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THE present century has seen Extensive developments in methods of с 
measuring scientifigally the trait¢ and abilities of human beings. 9 
Three main influences may be discovered behind this movement.” 
First, there i$ the remengous interest Which people have always 
shown in judging qne another’ $ qualities. Much of their conversation 
cónsists in summing up theireacquiaintetnces. Their Plans, either for e 
work or play, are very largely guided by their notiorfs feghrding і the 
persons with whon» they are about to deal. It is obvious that süch 
notionsearg frequently biased or in&ecurate; afd yet each one con- , 
tinues to rely, in great affairs or small, chiefly ок his own insight into ' 
the characters of othérs. However, itis recognized"that if certain 
spheres of human activity. more acturate and impartial, assessments « 
are essential. The examination system, now firmly,established in 
schools, universities, and in many professions, is supposed to meas- 
ure the hchievements and capacities of pupils, of students, and of 
candidates for yarious jobs. Although this system is certainly ап“ 
advance on the earlier system of personal judgments, yet much 
doubt exists 45 to its efficiency. A considerable segtion pf this book 
is therefore Веуотед toa discyssion of the«defeets of examinations, 
and of present-day methods of marking pupils’ or students’ work, 
and toa study of how these may be improved. . 

Secondly, nfankind is interested, not only in ifidividuals, but also? 
in general problems of human nature and human institutions. Thus 
apother field where accurate measures of human qualities : are needed 
is the scientific investigation of mind, behaviour, and society. Loose 
thinking and prejudige are extremely prevalent in Mattes of politics, 
economics, religion, crime, and the like--as is shown, for example, 4 
by Thouless i in his, „ book*Straight and Crooked Thinking (1930). Vet 
the sócial sciences, КП as psychology and sociology, are endeavour‘ < 
ing to study. these - ‘subjects rationally and impartially, in the same с 
manner as the physicist studies the от, or the physiologist the 
workings of the body.,Now the data from which a science is built 
up аге very largely numerical ‚апа quantitative. The records of a 
chemical or physical "experiment usually cónsist of measures of 


weights, times, distances, galvatiometer or thermometer'readings., 
€ ^ Р. є 
« ~ e а * x 
E . 
e 


6 


bd . 
ОР 
"vi У PREFACE . 
Measurement is almost equally important for purposes of precise 
description and experiment in the social sciences ; and the deduction 


* gfsound conclusions¥rom such quantitative data fi requently involves | 
the application f st&tistieal treatment. In this connection, then, we 
slrall consider below the naturesbf measuremept in psychology, and | 
4he elements of statisti@al technigue? with espectal réference to the à 

» field of education. . 


% A third source of the mental testing movement lies in the desire 
for human bettérment. Medical science pas advanced enormously in 1 
its ability to diagnose and treat bodily ills, but psychiatry and psy- 

» chology are only begifning, to learn how to handle mental abner- 
malities? Ү@ there are already, іп addition to the mental hospitals, 
numerous clinics where the diagnosis and treatment of emotionally 
maladjusted patienty—adults qnd children—are carried Qut, Special ы 

' schools, and tutorial classes in some ordinary schools, play their 
part by providing a type of education suited to children who are too 
greatly handicapped, mentally or physically, to make progress under 
ordinary school conditions. Industrial schools and the’Borstal system 
aim at re-educating delinquents and budding criminals,so as to turn 
then into useful members of society. ‘The staffs of all these institu- ; 

„Нойз need accurate methods of measuring the educational and 
occupational capacities of the “patients,” Actually папу of our most 
valuable mental tests were first constructed by workers in the fields 
ог educational, emotional, and vocational guidance. 0 

This book is thus intended primarily Tor three classes of students, 
namely teachers and examiners, psychologists ’and others who set 

out to investigate human phenomena from the scientific staridpoint, 

and doctors and psychologists who wish to make practical use of 
tests in clinics and schools. For reasons of space, no attempt is made 

to describe the Various mental tests in detail, but a classified list Of | 
those used in this country is included in Chap. IX. Майу authors | 

,» have emphasized the valye of tests without sufficiently drawing | 
attention to their dangers and difficulties.7Thus the main object of 

„ this books to outline the principles of test construction and applica- | 

ә tion or procedure, and the interpretation of the results of tests and 
examinations, ignorance of which on the part of testers has fre- 

» quently led to serious errors. Mental measurement undoubtedly needs 
a high degree of skill; which few of the persons just mentioned have 
either the time or the opportunity to acqmire thoroughly, before they 

„set out to test or examine. | | 
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There are also available*many excellent textbooks on statistical 
methods, perhaps the most useful to tésters being those of Garrett 
(1953), Dawson (1933), and Gilford (1936 Yet in the present ‹ 
writer’s experience, all of these are too difficul. forstlee majority of 
students Whose psychological traizing only amounts to between 
seventy-five and опе hundred, арф fifty hours. This book aims, 
therefore, to describe the minimum essentials, to show their imme- с 
diate application to psychological (and especially to educațional) с 
problems, and to stress their generat Sigijficance rather than their 
technical details. 9 : ; 

More advanced Students are unkkely to,find within much that is , 
original. But, so far as the writer knows, no previous account of the 
statistics of marking, nor of the construction of new-typé examina- 
tions, has been published by a British authors Tbe analysis of educa- 
tional measfirement and the discussion of the technique of examining ‹ 
are partly new but owe a great deal to Hamilton's brilliant book, 
The Art of Interrogation (1929). Possibly also some of*the practical 
hints about tests and tesfing have not been published bcfore. 

Lack of space has nécessitated some important omissions, such as 
the whole field of temperament and character assessment, the #65 пас 
of special (vocational) aptitudes, and‘ of sensory-motor capacities. . 
Actually few of «ће tests falling under these headings are well stan- 
dardized, and fewer still can safely be &pplied by those who are not 
trained psyckdlogists. қ 3 ; 
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+ PREPACE*TO THE SECOND ЕрГМОМ 
? ~ 5 ы > ye 
THE general principles of терь testing and ејететагу statistical 

. analysis have changed but little мл the sixteen years since this book 
? was wsitten. But the educational context in which, they were origin- 
ally set has alte?ed considerably, notablywith the passing of the 1944 
Education Act. This edition, then, brings the kook up to date in 
о respect af selection for seegndaty education, though it is still,» of 
course, Coneraed*only with the critical study of methods—not with 
thé selection problem as a whole. Next, there gave been important 
‚ developments in the field ofjintelligence "theory and intelligence 
` testing, and it is honed that the views put forward here. represent a 
more reasonable picture ofthe relations of innate and acquired 
abilities and’ "of the findings of factorial analysis thanshas hitherto 
been availablé to education and psychology student The compre- 
Beni list of published mental tests has been completely recon- 
‘structed. Several of the statistical seetions have been revised and 
slightly expanded to Cover more of the problems which the examiner, 
mental tester, or research worker are likely to meef, while avoiding 
as far as possible any increase in complexity. Thus in general the 
book covers the same ground as the original edition#at much the , 
same length, but in a more up-to-date fnd thore accurate manner. 
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Ir is an unfortunate, yet indubitable, fact that the majority of < 
students of psychology ‘and education find extreme difficutty in ‘ 
grasping and applying the statistical concepts which are an essential 
prerequisite to sourd mental measurement. They take fright at the 
mefe mention ОҒ statistics or at thé sigkt of statistical symbols and © 
formule, A small minority, on the other hand, fifid 11 thetmanipu- 
lations of numbers which we call statistics an intense fascination, 
closely akin £o the musical сотровег 6 fascinatiof: with interweavings 
of sounds. Probably statistical ability is asyspecialized and as rane as 
is ability in musical coinposition, and ‘the expert statistician rather 
easily loses touch with the doubts апа difficulties of the beginner. 
He fails to realize that, the statistical way of looking at educational 
or psychological phenomena, which comes so naturally to him, is 
at first emtirely foreign to the average Student. Yet the student's 
'phobia' is certainly unnecessary, since the elementary statistics | 
needed for the proper treatment of marks and test scores includes 
little more than the kind of algebra and arithmetic which һе mastered 
by the age of 14, Statistics means nothing mere than numerical data, 
such as measurements, examination and test results; and statistical 
methods are merely he ways of dealing with, or organizing, these 
data so as to bring out their full Significance. — * poe 
The raison d'étre of statistical methods is that human beings are 
apt.to differ so widely from one another in any and every measurable 
characteristic, that results or conclusions which apply to one person 
orone group of persons (e.g. one school class), do not necessarily apply 
to another person or group. First, then, we must have efficient tech- с. 
niques of arrangin gor tabulating the scores and marks obtained fram 
different people,inordertoseewhatisthegeneralrunoftheirscores,and' . 
what is the range or extent of the differences, These techniques are , 
described in the present chapter, The ‘general run’ and the ‘range or 
extent’ оЁзсогез next requiremore precisedefinitionandmeasurement; * 
that is to say, we must know how to work out averages and indices 
of what is called spread, scatter, dispersion, or Variability (Chap. IIT). 
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Not only do human beings differ from one another, but also each 
individual varies in his capacities and characteristics from time to 
. time. The score he achieves to-day шау or may not be typical of his 
ла] accomplishments. The same holds true of a class of pupils or 
students. The extent of this untrustworthiness or unreliadility (as it 
„is generally termed) of a single measurement ог of aset of msasure- 
2? ments therefore needs to be determined. Statistical treatment will tell 
^ , us whether any alteration or veriation in ssores represents a real and 
important change—such'a change, for example, as we might hope to 
bring about by means of the teaching biven tó the individual or to 
(ће class—or whether it is mersly а matter of"change fluctuations. 
A further уёгу common and impoitant problem is the reality or 
significante of a difference between two groups of people. For 
instance, girls may, appear to be better than "boys at reading; but 
» this is not true of all girls and'all boys; so we must have a Method of 
estáblishing how geheral'ànd how reliable is the difference (Chap. V). 
Still another type of variation requires special techniques of 
investigation, namely, the variations in people's capacities in differ- 
ent spheres. It is obvious that no one is even all-round in his level 
„ of abilities. Yet we are apt to assume that high achievement in one 
| field implies high achievement in other fields. Sometimes also we infer 
the opposite type of relationship—that if a personis strong in some 
trait, then he will be lacking in some other trait. The statistical 
methods of correlation enable us to say how far various traits or 
abilities ‘go together’ ої are related 4 pne another. They also tell us 
how far our tests or examinations measure what they aré’supposed 
‚ to measure, i.e how valid they are. For exainple, a Certificate or 
Matriculation examination at the age of 16-18 тау, or may not, 
validly predict success in a subsequent university career (Chaps. 
VI-VII). › » р 
Although statistical methods are, then, of undoubted importance, 
their limitations should also be recognized at this stage in our 
discussion. They авг tools for handling,nymerical data, but they are 
; „ not capable of improving data which may be initially inaccurate or 
biased. When an investigator uses a poor test or examination, or 
when he observes and records people's behaviour or mental processes 
. wrongly, no amount of statistical treatment of his results can correct 
Such deficiencies. Further, the fundamentally abstract nature of statis- 
tical methods needs to be stressed» They аге adapted"primarily for 
mass-inyestigations of measurements obtained from large numbers 
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of persons, and such investigations are apt to lose touch with the 
concrete individual case. Without them we could not makersound 
generalizations about human beifigs, but even with them we cannot e 
usually state whether a generalization applies te a particular hunfart * 
being. TH ВгасНса] handling and vpderstanding of an actual child 
or adult is stillyan art father than a science, though it тау be greatly, 
assisted by scientific mental tests and guided by the results of objec- 
e 


tive psychological experiments and statistical studies. 2 . 
® е е « 
а ' TABULATION OF *MEASURES +: 


Measures аңа Vdriables«—By а variable We shall mean an ability 
or some other characteristic irf respect Of which human beings vary 
or differ from one another. Height, age, income, ifitelligence, 
examination success? areall variables in this sense. A measure is the 
numeric&l expression of а certain standing oh, or amount of, a » ” 
variable, Fifty inches of height, five ppurfds a Week income, ahd a 
mark of 6 out of 10 are examples of measures. . . 

Most уамаћјев are continuous; that is to say, they ¢an vary by $ 
infinitely small gradations. But we only measure them «o the nearest 
convenient unit. For example, heights may be measured to the %ear- e 
est tenth of an inch. But actually 50-1 inches includes all heights from А 
50-05 to 50:15; 50:2 inches includes from 50:15 to 50:25, and so on. 
Similarly, a school mark of 6 is usuallyean approximation represent- 
ing all degreeg*of quality from 54 to 64. If, however, half-marks are 
awarded, then 6 represents 58 te 64, and 64 Tepresents 6} to 63, and 
so on. ВФ some variables are discrete er discontinuous, ѕійсе they 1 
only vary by units. For instance, if the size of a class of pupils is 40, „ 
this does not imply between 394 and 40$ pupils. The scores for niany 
tests, and some school marks, are of this type, for they represent the 
схае! numbers of questions or test items correctly answered. Thus a 
pupil would score 6 out of 10 for arithmetic sums even if he had 
nearly completed a seventh sum. (Such measures are referred to in 
the next chapte? as count ор enumeration scores.) $ 

Thiş distinction between continuous and discrete variables nects, | 
to be borne in mind when tabulating measures (cf. p. 5). 

Tabulation.—Tabulatior means arranging a set of measures іп an ” 
orderly fashion, so as to convey the essential facts about them in 
сопуепіећ and relatively brief form. If we take a class of pupils, and 2 
beside each püpil's name write his age, or exam. mark, we get a list, 

-but it is not a table, since the measures are quite haphazardly 
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arranged. It is very difficult to see from guch a list, if it be a long one, 
what B the general run apd range of measures, or whether any 
particular measure (e.g. John Smith's 6 out of 10) is high, low, or 
eaverages 2724229, 
Ranking.—One way of taUulating is to rearrange all tho measures 
in order of size from highest tà lowest; that is, to pyt them ip rank 
“order. The desired infbrmation?can then be grasped much more 
^ readily. But this is a laborious process if, theenumber of pupils is 
large? sy, spen 
Frequency Distribution}. —Under such circumstancés it is better to 
wrife in a column each possible measure, in order, and beside it the 
© frequengy, F; that is, lhé?number»of persons who obtain each 
measure;, The'tesult is called a frequency distribution, since it shows 
how the various measures are allotted ор di3tributed among the 
> a persons. E J o 2 


° о 
Ех. + Тре following 15а list of the marks (out of 20) which 
were awarded to a class of 50 риріїѕ. The table below shows their — — 
frequency Uistribution. Marks: 10, 7, 5, 17, 15, 13,24, 6, 9, 15, 16, 1 
20, 17, 19, 14, 9, 11, 17, 18, 8, 1, 9, 12, 19, 16, 15, 17, 14, 13, 10, 6, 
220,55, 18, 17, 13, 11,4, 16, 13, 8, 14, 18, 15, 12, 17, 16, 15, 17, 2. 
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Hints on Drawing Up a Krequency Те able.—Do not look through 
the original list and pick out all who got«he highest measure; e.g. 20, | 
then all who got*the next highest, 19, and so en. This would bea , 
waste of time and would easily lead to mistakes, Instead go through 
the теазућев in thelist consecutively, аћа as each one is reached,sput 
a tally or check matk opposite tke appropriate measure in the 
column. When any one meastire recurs for the fifth time, draw a line 
through the first four tallies. Кебаб this on reaching the tenth, . = 
fifteenth, etc., tally. Add up the tallies to ebtain the F column. Add 
up the*F column 14581 to шаке sure that it checks with N, the total 
nupber of pegsons? ' 2 - & 

Distributions of Classified Mvasures.'The above procedure is still а 
likely to produce an over-lengthy table, especially when WY is large. 

It would be absurd,’ forsexample, to tabulate the heights of all the 
pupils of a sthool in this manner. Instead of listing the frequencies of , 
children 40-0, 40-1, 40.2... inches tall? we would give the totals 
with heights of 40, 41, 42 . . . inches, or the totals иһ heights of 
40 upwards, № upwardss 44 upwards, and so on. That is, we would 
group together sets of*contiguous measures under a series of classes 
or categories. E E RS 

Considerable care is needed in deciding upon what is, or is pot, 
included in a class. When the variable is discrete—for example, test^ 
scores ranging from 10 to 93—we тауса! our classes 10-14, 15-19, 
20-24, etc., at else 10 +, 15 +, 20 +, etc. Botheof these indicate 
clearly that all scores o£ 10, 14, 12, 13, an 14°аге included in the 
first class, and that the size of each «lass is five units. The same 
methods may be ustd for school marks. For instance, 10%-14%,. 
1597-1997 . fis unambiguous, although, strictly speaking, the а 
first class includes all shades of merit from 94% to 144%. But with 
sefhe continuous variables such as age or height, difficulties often 
arise Ђесамзе the original measures are themselves only approxima- 
tions which represent classes of measures (cf. p3). Thus if height 
has been measùred to the nearest tenth of an iach, and is tabulated 
to the nearest inch, we should avoid calling our classes 40, 4142 
. . ., etc. For 40 might include 39-95 to 40-95, or 39-45 to 40-45, ог” 
39:55 to 40:55. If we зугмемо --, 41 +, etc., the first of these is usually © 
intended, With ages, 5 +, 6+ . . .eis quite clear, since the 5 + s 
class inéludes all children who are at, or who have passed, their fifth 


4 


birthday, Ый have not yet reached their sixth. , 2 
No set rules can be laid down asto the choice of classes, but the 
М.А.--2, Ж. 
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following additional hints will apply to most of the tables which 
teacheys or ‘psychologists will need. 

; (1) The number оба егел! classts is generally somewhere between 
and 15. Somestablgs cpntain more, few less. When the total number 
of persons, М, is only 50 or 155,8 small number of classés i generally 
sufficient; but if N is nearer 500, a more detafled table with a»larger 
‘number of classes may be desirable.” 

, . QyPick out from the original"list of measurés the highest and the 
lowest measure to be tabulated, and deduce from them the total 
range of the distribution. If tht range i from 20 to 1, then 7'classes 
containing 3 different measures sach arelikely (0 be appropriate. If 

' the range is from 137 to 63, then 13 classes of 5 are a possibility. 
In most instances all the classes should contain the same number of 
different measures, i.e. they should be of the same size.’ 

(3) It is desirable, though not essential, to include an'odd number 
of different measures in each class. The reason for this is that in any 
subsequent calculations we are going to regard all the measures 
grouped within any one class as lying at the midpoint of that class. 
If the classes include, say, 3-7, 8-12, 13-17, 18-22, etc., then we 
shall.zount all the 3's, 4's, 578, 6’s, and 775 as 5's, and all from 8 to 12 
as 10's. The midpoirt of a class is found by taking the average of the 
highest and lowest measures it can contain; e.g. (3 + 7) + 2 = 5. 
The same figure is the midpcint of a class whose true limits are 24 
to 73. But г class such as 39-95 to 40-95 has the rather inconvenient 
midpoint of 40-45, ~ See 

(4) With school marks it is advisable to arrange for айу round 
numbers, such аз the 5's and 10's, to fall at the centres of the classes. 
Teachers are very apt to award these round numbers and to neglect 
intermediate ones. Hence, if classes of. 10-14, 15-19, 20-24 ... . are 
chosen, most of the frequencies may actually Пе at 10, 15, 20 .'.0., 
and it would be illegitimate to regard them as lying at the midpoinis 
РА! "The difficulty will be surmounted if ciasses of 

) 8-12, 13-17, 18-22 v. . are used instead., 

; (5) Having chosen the classes and written them out in a column, 

‘tabulate the F's by the ‘tally’ method, as before, and check the 

“total, М, г v 


^ 11 the distribution is very far from symmetrical, as in the case of incomes, 
School attendances, and the like, a table with unequal-sized classes may better 
portray the state of affairs. But no averaging or other statistical calculations can 


S sos with suck. a table. It may be analysed, by the chi-squared technique 
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Ex. 2.—Вејо is the distribution of the measures already listed Г 
in Ех. 1. They have been grouped into,classes of 3 macks ech. 


Classes s » Е % ө 
„18—20 HH | ....24 (у 
154 pontes ша ну ач он, 9; 
~) 12-14 врту НЕ ё 2 э Ене A 

9-11 544. f ]* қ 7 
6-8 4L . e 5 Е e e 
3-5 TS БЕА 5 E zie : 
6-2 11 з x е wee } 

S 4 5 e 4 + М==50 


e e * 


GRAPHICAL REPRESENTATION OF DISURIBUSIONS 
A distribution may often be portrayed more clearly byemeans,of 
a graph, either of the type known аҙ a histogram, or by a frequency 
polygon. à Ir? both of these the frequencies. are plotted on the vertical e 
Е е 0 e. > 
18} о ы | à * 


0127545 698 9101 12 13-14 116 1 1639 e 
Бо. 1.—Histogram of distribution from Ex. 2. S 


i The GENERE of graph, or cumulative frequency curve, will be omitted 
here. The reader may find acgounts of its construction.and uses in тог advanced 
textbooks, such as Garrett (1953) or Dawson (1933). 
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(Y) axis against the measures, or classes of measures, on the horizon- 
tal (X) axis. The scale of the graph and the location of the origin (the 
» point of intersection wf the axes) afe purely matter$ of convenience. 
Suppose, for схаторје, лаг our graph paper measures 6 inches x 9 
inghts, each inch being divided јако tenths, and that we'with to plot 
the distributidh given in, Ex. 2. We mjght use £ scale ef 3 persens to | 


, the inch on the Y axis, and опе class (3 marks) to the inch on the Х 
» » о 
Е » 


м. UE | 


» 


р 
а? Sart ы. 


5 = 
, р 74 7 10 15 16 эх 
Fic. 2.—Incomplete frequency polygon of distribution from Ex. m 


axis. This has bean done in Figs. 1-3. Fig. 4 portrays two distribu- 
tions of intelligence test scores which have been grouped into classes 
3 73-77, 78-82... 133-137. Each inch on the Y axis repiesents 5 
persons, and each inch on the X axis includes 2 classes (10 points) 
gf Score. , › x 
3 A completed graph generally looks best if its largest vertical and 
horizontal dimensions аге roughly the sare? Its appearance will be 
, too peaked if the scale of frequencies is too large, or the scale of 
measures too, small, and too flat i$ the converse holds. 
The ovigin is often placed at X —0, but there is no ‘heed for this. 
Tou in Figs. 2-3, it is at X — — 2, in order that the Y axis may 
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be drawn at the left-hand edge of the graph paper. And im Fig. 4 it 
is at X — 100, in order that Ше Ү axis may be in the ‘centre of the 
page. z . 

The Histogram or Column Diagram. —Fig. 4 is ә histogram fof- 
traying the listribution given in Еж 2: “It consists of a series of pillars 
or columns dgawn, 5 е by sid. Each pillar гергеѕеп one measure 
or, as in this instance, one class oft measures, and its height represents 
the frequency of thts measure or ĉlass. The marks from 0 1020 aree 
written along the Х “axis, but note that саф mark is z$ presefited bya 


e^ Fic. 3.—Completed frequency polygon of distribution from Ex. 2. 


• 
space of 4 of an inch on this axis, not ру one of the fch or tenth- 


e 


inch Zines. The first class (including 0, 1, and 2 marks) is, therefore, У 


represented by the whole width of the first inch, the second class (3, 

4, апа 5 marks) by the whole width of the second inch, and so on. 
The Frequency Polygort.—All the measurés included in a class are® 
here represented by a point on the graph, whose Y co-ordinate is the , 
frequency, and whose X co-ordinate is the midpoint of the cla$s. 
Thus, in Fig: 2, which represents the same distribution as before, the 
midpoints are written along the Х axis, and this time they are located 
$ 
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© 


either af an inch or tenth-inch /ine. The points on the graph are 
connected up by straight lites. Note that it is incorrect to smooth 


» the lines into a curve, like those їй Figs. 10-14. The figure is, as it 


Wete, suspended in midyair, hence it is usual to complete it (Fig. 3) 

by, connecting the first and last, points with additional poihts whose 

У co*ordinatés are zero. The X co;ordinate& of these additional 

points are the midpoints of the classes just below the lowest, and just 

» above. the highest classes, in the distribution, іе. X = — 2 and 
Х = 22. 2 У А 

Ex. 3.—For further illustration there follow the frequency distri- 

о butions of scores obtained by 135 men and 114 women students.on 

the Cattell Intelligence Test, Scale ША, The scores are grouped into 

^ classes o$'5. The distributions for the two sexes are plotted as 

frequency polygons on the same graph (Fig. 4, and for the sexes 

, combined as а histdgram (Figk*5). Note in the latter that the scale 

along the Y axis has been halved, since combining the sexes roughly 


D 


doublesethe frequencies to beeplotted. О 
E А d 7 Frequencies » 
Test Scores Men omen Cümbined 
135 + 1 0” 1 
Зав F ERU efi 1 
T 123+ , 5 1 6 
118 + 8 3 11 
113 + 13 9 22 
108 + "17 9 26 
103. 21 14 85 
98 + 17 39 47 
93 + 23 20 43 
88 + 14 13 27 
83 +° 6 12 218 
78 + 8 2 10 
73 + 3 0 3 
М = 136 114 250, 


Relative Merits of Frequency Polygons and Histogranis.—Fre- 
quency polygons possess the advantage ovor histograms that two or 

, more distributions can be superimposed and compared on one graph, 
„asin Fig. 4. But they are inferior in that the lines connecting suc- 
cessive points give a somewhat inaccurate ñótion of the distribution, 


» especially when the number of points is small. For instance, in Fig. 4 


the connecting line between the two points X = 110 F = 17 and 
X = 113 F= 13, suggests that 15° men students obtained inter- 
mediate scores of 1125, whichis quite untrue. In this respect the 


i 
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^ а D e 
histogram is unexceptionable, since it gives an exact portrayal of the 


tabulated data. The total area enclosed within a histogram corre- 


sponds accurately to the total number of persons, №, The area within m 


a polygon corresponds only approximately. Histograms havea furthes 
advantage, aamely, that they can he etnpfoyed when the variable is 
not a,truly quantitate one. For example, examination Scripts are 
often graded, not with numerital,*but with"letter, marks—A, А E. 


В +, etc. The frequency, distribution of the numbers of examinees aps 


who are awarded each letter can be chown by a histegrame © 
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Fic. 4.—Frequency polygons of distributions from Ex. 3. 
3 <—— Меп. ----- Women. e 
А hie 4 - 
£HARACTERISTICS OF FREQUENCY АО LJ] e 
Smoothness of Distributions.—Here we must distinguish provi- * 


sionally between ‘objective’ and ‘subjective? measures, and discuss © 


the difference more thoroughly іп the next chapter. Objective 3 


measures are those which are practically independent of the person 
Who makes them, such as heights measured with a ruler, or the 
Scores on a standardized educational or intelligence test. Subjective 
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Fic. 5.—Histogram of distribution of men's and women's scores 
А combined, from Ех. 3. » 
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Fic. 6.—Distribution of test scores of a small group. 
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measures, by contrast, dapend upon personal evaluations. They 
include, for example, the marks awarded Буға teacher to a series of 
English compositions, assessments of pupils’ characters, and the like. « 

Now when graphs are drawn of frequengy distributions of objec- 
tive meaSufes, certain characteristig "features are very comnfonly 
found. First, she graphs are generally very irregular^when the fre- 
quencies are small, but the ifregularities tend to become ‘ironed оц’, 
as the numbers increase. And when М reaches a value of Several. 
hundreds, the frequency polygon of histbgram apptoximátes toa 
smooth curve. Thus the distribution'in Fig. 5 is much more regular 
in jts rise and, fall than is the distribution ir іп Fig. 6, which represents _ 
the scores of a group of 50 students, ‘selected at m from the ° 
previous 250. 

Fig. 7, which is а frequency polygon of the heights of 6, 194 р 
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adults, gives an almost smooth curve. L 
e 
е LJ 5 5 7 ^ 
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• ° 
e = ж 
. 
Ж е; е 
" 
e 
800r * 
. . 
е 
600} » 
» |, 
е © 
. 
• 400 
> 
е 
› \ е 
s ° е, 
„ 
209 x *. 
. e 
D 
° 
о o 
55 © . 60 65 ду 75 INCHES 


Fic. 7.—Frequency polygon of heights of 6,194 English male adults. 


ə С 
s M © 


Ce 


70 75 16 +9 82 85 88 91 94 97 106105106109 12 115-118 121124 127 150 155 156 x 
Fic. 8.--Ігершаг distribution of test scores grouped in stall classes. 
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і 65 75 85 95 -105 5 125 155 145; 
Бо. 9.--Моге regular distribution of test scores grouped іп lazge classes. 
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Frequencies may be increased not only by increasing the value of 
N, but also by grouping the measures into larger classes. Figs. 8-9 
show frequency polygons for (ће'зате set of 250 measures grouped e 
into classes of 3 and 10 respectively. The latter 1; distinctly smoother 
than the (огћег. E А 

Types of Dietributibn Ситуеу— Те. actual, practice We can never, 
obtain measures of a large enough. number of persons for the distri- 
bution graphs to Ђесоте complettiy smooth curves. Nevertheless, 
it is customary to represent various types df distribufions piétorially 
by međns of curves of appropriate shapes, to which obtained distri- 
butjons approximate more 
or less closely. Forexample, 
we can say that Fig. 10 
represents roughly the type 
of distributión which might 
beexpected for the incomes 
of parents of elementary 
school childrere The gredt- 
est number ,would prob- £500 £1000 £1500 £2000 INCOME 
ably Не between £300 and юс, 10. сугу curve, illustrating арртох- 
£800 per annum, and the mate distribution of annual inceme | 
frequencies of those at children. parent coi ешш деді 
higher income levels would 
progressivelyeđiminish. This particular type of distribution is called 
a skew curve, because of Жз asymmetry. It is positively skewed 
towards the upper end, and shows a pifing up. of cases at the lower 
end. A bunching at the upper end would give а mégatively skewed 
curve. 4 

Fig. 21 is a bimodal curve such as might be obtained if we graphed 
Ше intelligence quotients: of pupils in two schools, one containing 
very intelligent children of professional class parents, the other con- 
taining inferior childfen of very poor stock. Most of the 1.0.3 fall ,. 
either between 110 and #40 or between 70 and’ 90, but there are a 
few of even һірһег or even lower intelligence, and a few (probably: . 
drawn from both schools) of intermediate бг average intelligence, _ 
with I.Q.s of 90 to 110° 

The two polygons which appear in Fig. 4 are irregular on account 

*It is assumed that all readers ius roughly what is meant by intelligence 
quotients (I.Q.s). A definitionemay be found on p. 19, ánd'a more precise account 
9f their meaning on p. 194, S f 
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eof the small frequencies 
involved. They approxi- 
mate, however, to a pair of 
curves such as those shown 
inFig. 12. Noté here that the 
moré highly peaked curve 
* (representing women's 
soores) does not indicate 
that women,obtain, on the 
whole; higher scores than 
'men, but that women obtain 
xc» more scores close to the 
average of 100 points, and fewer extreme scores; in other words, 

that men are more, dispersed or spread out in their ability. 
» — It should be remembered that the areas enclosed benéath distribu- 
tion curves are proportiohal.to the total numbers of cases involved. 
~ In Fig. 12 the areas аге 
roughly equal, since the total 
numbers of the two sexes are 
‚ about the same. Fig. 13,shows 
the curves to which the dis- 
tributions of scores among 
Honours andamongordinary 
16 degree students approximate. 


ји 221 pu iud SUL. BO. Тһе latter have, on the 

с. 12.—Two distributions of intelligence ato 

test scores, with different dispersions. whole, somewhatlower 500188 
than, but аге four times as 


» 

numerous as, the former. It is more usual, when groups of such 
different sizes are being compared, to plot percentage frequencies 
instead of actual frequencies, 
since the curves will then pos- 
sess thé same area, ánd differ- 
ences in the averages and 
ranges ordispersions of meas- 
ures will stand out better. 
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Fic. 11.—Bimodal curve, illustrating in- 
selligence quotients of ‘mixed groups. 
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2 
Fic. 13.—Distributions of intelligence test 
scorestof two groups,.one much larger 
than the other but of lower average 
intelligence. "o. 
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DISTRIBUTION 
Whenever a large group of 


human, beings is taken at ran- 
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dom (i.e. not specially selected like the two schools whose ТО. are 
graphed in Fig. 11), then objective measures of almost any frait or 
ability they possess—physical or niental—will befound toconformto 6 
a certain type of distribution known as the noamal curva This is оббп“ 
referred (6 As the normal probability curve, the normal curvé of 
error, or the Gaussiarf curve. I( is symmetrical, bell- of cocked-hat« 
shaped. It indicates that the majority of persons in the group obtain е 
measures round abdut the averages relatively few obtain extreme • 
measures, and that the frequencies above {һе average are the same 

as the frequencies below it. The four curves in Figs. 12-13 belong to 
this, type, and all the graplfs from Fig. 4 to, Fig. 9 roughly apprexi- > 
mate to it, since they represent distributions of scores ebtained by 
unselected groups of «persons. Innumerable instances of distributions 
which tend to this normal shape соц} be cited ве heights of chil- 
dren in an ofdinary school class, the numbers of times they сап tap © 
with one finger in а minate, or the numbers of Words {hey сад write. 
Variations in individual performances as well as differences between 
individuals show the sanfe trend: if a person tries a hantdred times 
to copy a ling of a certain length, most of his attempts will be close 
to his,own average, and a few, will be considerably too long ofetoo « 
short. Differences between groups art also similar: if all the ¢le- , 
| mentary school classes aged 11 + in a large city are given an objec- 
tive arithmetic test, and the average mark of each is calculated, then 

| a graph of aM*these averages may be expected to shoW a normal 

distribution. oe? 

Conditions which Upset the Тепаепс) to Normality.—There аге 
several factors Which may disturb this normal tendency. First, the 
total frequencies involved may be so small that the irregularities, 
noted above, may mask it. It is usually apparent, however, even with 
greüps of 50 or less (cf. Fig. 6). Secondly, the grougf of persons may 
be selected 4n such a way as to distort an otherwise normally distri- 
buted уайаЫе. For example, a school class which has beén ‘creamed’ 
will be likely to show a negatively skewed distriBution of test scores 
ог examination matks, since there will be a deficiency of .pupils Ofe 
high ability. The police force or the army will not show a normal 
distribution of heights, tfno recruits are acĉepted below a certain 
height. * S 

Thirdly, ther$ are a few humafi characteristics which, although 
Objectively measürable, gre knówn to be abnormally disifibuted. 

. — Accident proneness is one such vaziatle. It is found that a few people 
. 
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$ IN A CERTAIN PERIOD 


» Fig? 14.—Approximate frequeasy distribu- 
tior of accident proneness 
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are liable to undergo a large 
number of accidents 
any given period of time, 
whether at home, or in fac- 
tories, or wher diiving cars. 
But the, majority have few 


in 


” orno accidents.’ Thus the 


distribution takes roughly 
the form shown in Fig. 14. 
Annual income also: gives а 
$kew distribution, and jt is 
likely that other measures 


d : of ‘productivity’ such as 
literary or scientific output are similar (cf."Burt, 1943). It is hardly 
possible to prove that intelligence quotients conform to the 
normal curve, though this іѕ-а convenient assumption that accords 
reasonably with commonsensevobservation. It does break down at 
the bottom end of the scale, since there are ‘more imbeciles and idiots 
of very low 1.0. in the population than would be expected. Some 

» investigators have claimed other marked divergences from the nor- 
mal shape, but these might be attributable to inequalities in the units 
of measurement. 

Such irregularities, or defects in our methods of measuring vari- 
ables, constitute.a fourth condition, which is of very frequent 
occurrence, as we shall see in the néxt*chapter. 


E 
Le + Cf. Н. M. Vernon (1936). ^ 
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We are all accustomed nowadays to fegarding a scholastic Capacity 
(e.g. arithmetical ability), lite height*or weight, as ranging from a 
very small amgunt ih some*people to a very large amount in others, 
and to assessing these amounts in numerical units guch as-percentage 
marks. General intelligence and other aptitudes can similarly ke 
graded as quantitative variables, average intelligence being generally 
denoted by Фп Intelligence Quotient or Т.О. of 100, superior ог c 
| inferior intelligence by larger or smaller,nuinber$. Yet the conception 
K of scaling mental qualities in a manger similar to physical ones is of 
fairly recent Owgin—we owe it largely to the work of Sir Frahcis 
Galton—and jit is still somewhat limited in application and beset 
with many difficulties. A human ability or trait is far more corfiplex « 
than а len&th, a weight, or a time; it cán never be completely герсе- Е 
sented by a single numeral on a scale. Consider, as an example, 
political opinions—a communist is certainly more socialistically in- 
clined than a diberal, and a fascist is more reactiofiary than a con- 
servative. Successful tests haye,stherefore, been devised which scale 
these attitfides as a single variable ranging from extreme left-wing to 
extreme zight-wing. Hut such a scale сап hardly да justice to all the 
various shades Of politica] opinion ; it inevitably involves some over- 
simplification and artificiality. Particularly in the field of tempera- 
mant and character does such distortion arise, for here people appear 
to us to уаву from one another, not merely in amount, but also in 
kind or quality.1 The same is true, though to a lesser extent, in the 
field of abilities. Thus the Rregise nature of intelligence is an extremely 
controyersial mattes, and the common tendency to regard, the 1.0. „ 
85 an accurate measure of some definite and stable mental entity is, 
as we shall seë below, aif egregious misinterpfetation. 5 

No mental trait or ability can be measured directly by laying some 
kind of ruler alengside it. Instead we have to grade it by observing" 
its expression fn words or actions. Arithmetica ability, for example, 


"Бог a fuller discussion of these Points, cf. Vernon (1953). . 
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is shown by success at à series of sums; moral character by certain 
modes of behaviour. Indeed, for scientific purposes, We have to regard 
› а trait or ability, по? as some power or faculty in the mind, but as a 
"particular class or category, of related performances, e.g. of the 
arithmetical or the moral kind, And special statistical téchhiques are 
required for deciding iyst what, performances should be included 
, under опе heading, as representative of one ability or trait (cf. Chap. 
> УШ), Since we cannot, of course, observe and record all the per- 
formances which go to}maké up a mental characteristic, we are 
forced to take samples of them. Our procedure, as Binet pointed out, 
, is enalogous to that of the, mining enginter who cangot examing all 
the grounó а given locality; but ilistead takes borings at various 
points, end from these samples deduces with reasonable accuracy 
what the district as a whole, is like. Similarly, the psychologist 
measures intelligence, or the teacher measures achieverhent in arith- 
metic, on the,basis of thé child's success ah sample intellectual and 
arithmetical tasks. An arithmetic examination. sets, perhaps, six 
questions, the answers to which constitute’a sampling of the general 
field of arithmetical knowledge which the pupil is supposed to have 
+ acquired. › i ~ 
How, then, is success at such sample performances expressed in 
numerical terms? The various methods of marking or scoring mental 
tests and examinations may be classified as follows: 
А. Efumeràfion or count scoring. e 
B. Ranking. Ж. Ай: 
С. Qualitative grading and classifying. 
. D. Numerical evaluation. 
Е. Mixed numerical evaluation and count scoring. 


А. АМОМЕВАТТОМ OR COUNT SCORING > 
This type of marking is the only objective type, in Ше sense that 
each person's mark is quite independent of the marker’s subjective 
evaluation of his work. When, for ехагарге, a series of simple arith- 
‚ metic sums is set, the answers to which are all either right or,wrong, 
or when one mark is’deducted for each mistake in a dictation test, 
then a pupil’s total mark is a count or enumeration score. Since this 
„method is applicable only when no doubt can exist as to the correct- 
ness or incorrectness of the answers which are counted up, it 35 
largely restricted to very simple kinds of scholastic work. We shall | 
: see later, however (Chap. ХІТ), that attainments in more advanced | | 
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and complex school ог university work can be fairly cee 
measured by sets of brief, objectjvely marked questions. Thé great 
majority of standardized educational and general intelligence tests 
follow this method, so as to eliminate the personal elefnent in scoring. 
Steel and *Г тап (1936) claim that eyen English compositions can 
be matked by sounting all the год and Бай points of мосађ агуг 
sentence-structure, and sentence-linking which they contain, The 
scoring is objective also among whatare known as performance tests 
of intelligence pr of special aptitudes. "The «hild is set some practical 
or manipulative task; and a record is obtained, either of the number 
of pieces successfully put together or of the number of seconds taken. 

This type of marking includes twe distinct methods of measure- 
ment which may be termed the rate method and the power method. 

A test ог examination пау contain a number of fairly simple tasks, 
all of approximately uniform level of difficulty. The score is then the 
number of these tasks» accomplished. within à a certain time limit. 
Alternatively, it may contain tasksswhich are graded» in difficulty 
from quite singple to уећу hard tasks, such that all persons being 
tested can apswer some of them, and scarcely any*or none can 
manage all of them. Неге na, time limit is necessary. The score is + 
again based on enumerating the tasks successfully completed, since 
those whose ability, knowledge, or ‘power’ is greater are able to 
answer tasks of a higher level of difficulty. These two methods have 
aptly been cosiipared to a hurdle race and a high-jump competition 
respectively. The table"of ¢es?s given in Chap. IX lists several 
examples ‘of both. Thus, Ballard’s One*minute Reading and Arith- 
metic*Tests and Burts Reading Fluency Test belong to the former 
type. Burt’s Graded Vocabulary Tests of Reading and Spelling and 
the Terman- Merrill Intelligence Test belong to the latter. Most tests 
aid" examinations, however, combine the two, i.e. they include easy 
and difficuft items, but also impose а time limit, so that the score 
depends Sn power and rate of working. « 

Comparison of Mental Count Scores with PhySical Measures.—We . 
must ask now whether such count scores can legitimately be regarded. 
as measures of mental abilities, analogous, say, to measures of height . 
and weight or of physical’abilities such as rate of rer tan power a 
muscular contraction, and the like. Aétually the numbe 
of which abilities are expressed show several. iffíportant d pole 
from numbers stich as inches Оп a ruler or ої > 


machine. The latter, physical, units аге > | zl 
МА—3 с | 
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any point on the scale. E.g. the difference between 50 and 49 inches 
is previsely the same as the difference between 40 and 39 inches. 
Moreover, they always have a zero point: e.g. 0 18. means no weight 
at ай, Menta? ability amits often lack these features, and this fact 
precludes us from adding, averaging, or otherwise treàtihg them as 
vif they меге true numbers. ©. ~ ee) 

For example, a one-year-old child might fail to pass any of the 
Terman-Merrill Intelligence Tests, but hiszero'score shows, not that 
he possesses zero intelligence, but that the tests are an inadequate 
measuring instrument for this purpose. Again, a child with an 
1.@. of 120 cannot be considered twice as intelligent as one with 
an 1.0.0$ 60, because we до not возу what is the zero point of 
the I.Q.sscale. Even in a rate test of, say, reading speed, a child who 
reads 50 simple werds in a mjnute, and so scores twice as much as 
one who reads 25 words in the same time, cannot prüperly be said 
to be twice ag good at reading. © 

it is obvious that the units ia a rate test, such as words read, can 
never be pregisely uniform in difficulty. Similarly, {ке various errors 
in a dictation test may vary considerably in seriousness. It should be 
по also that two children who obtain the same score on, а rate 
‚ test do not necessarily accomplish precisely the same items, since one 

may fail on, or omit, certain items which the other passes; and they 
may each make the same number of mistakes in the spelling of a 
certain passage, many of which are different. Our tests, then, might 
be compared to a weighing machine witha large number of pound 
weights, some of which are slightly heavier, some slightly lighter, 
than a pound. When using the machine to weigh children wedo not 
always pick out precisely the same set of weights. Now, in spite of 
these defects in the weights, we can claim that two children, both 
found to be 70-lb., are very nearly the same weight, provided that 
the irregularities, among the pounds are small. And if four children 
are found to weigh 90, 80, 70, and 60 Ib., we can claim that the 
difference betweensthe first two is likely io be nearly equivalent to 
‚йе difference between the last two. Thus tnis analogy indicates that 
when the number of items in a mental test is large, and irregularities 
are small, the mental units may be regarded as approximately 
equivalent. > 
| In tests of the pure,power, or combined rate and power, types, the 
items аге not intended to be equal in difficulty. Insféad, each item 
should exceed the previous cne jn difficulty by approximately the 
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same amount. For example, each test in the Terman-Merrill Scale 
(from Year VI to Average Adult level) is supposed to repr€sent an 
increment of twó months of mental age.’ But it is generally admitted • 
nowadays that mental age units are лоф aquévalefite and that “ihe 
differencé Between a child of M.A. із and one of 13 is probably 
muclt smaller ећапе ће differenge hetween children of 64 and 6* Here, 
then, it is as if we ran out of | (nominal) pound weights after a certain e 
point, and had to use further weights which progressively inereased 4 
іп heaviness. Obviously this leads to ' grave difficulties in the mental 

age mtthod of scoring (cf. Chap. ГУ). Several group intelligence 
tesis, howeves, do manage to тај equal incre- „ 
ments throughout. oe 0. 

The units of meagurement of many physical skills present simMar 
problems. That equal inérements of scpre do notenecessarily represent — , 
equal increments of skill can be seen from the example of golf scores, • 
It is decidedly easier fosa player who requires 140 strokes fona round 
to improve to 130, than it is for aeplayer who takes 90 to improve 
to 60 (or ever&—if we cduld assume golf scores to follow a geomet- 
rical rather than an arithmetical progression—to 65). Here, also, itis 
impossible to establish a тег point of ability, or to calculate ratios 
between abilities. 7 

Special techniques have been devised by Thorndike, Thurstone,, 
and others for the production of mea$uring scales with equal units, ° 
and for determining the zero points on such scales. They аге too 
complex for description*here. But they depend in the main upon the 
principle of normal distribution of objective measures, which was 
enunéiated in Chap. I. Well-constructed tests always tend to yield 
a normal frequency distribution; hence we are entitled to regard 
departifres from normality as indicative of bad construction or 
bad scoring. For example, a mental test which* contains plenty 
of easy ап moderately difficult items, but whose later items in- 
crease бо rapidly irf difficulty to allow sufficient headroom, will ,, 
give a negatively skewed'distribution. The analbgy with a weighing 
machine would be^as follows: suppose that the available weights 
beyond the seventieth became progressively" ‘bigger than 1 1b., and „ 
that a large ‘group of children were measured’ whose true weights ^ 
averaged about 70 Ib., and ranged froni 50 to 90 Ib. The distribution 
would then be “distorted from попа! to negative skew type, as in 
Fig. 15. Conversély, a test which contains insufficient easy flems, so 

! For a definition and discussion df mental age, cf. pp. 69-73. 
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z а . « effectively among the better 


pupils, is likely to show a 
positively skewed distribu- 
tion. Э 

Аррїїсдїоһ to ‘School 
Marks and Examinations.— 
ERES It is difficult to apply this 

50 60 , 70 80 90 principle to school marks, 
Fic. 15,—Normal distribution (а), altered since the groups are“usually 


tg negative skew curve 0), b: fees С > that 
> фе sizerof | the higher units, Y) шаш (ese tan 50) tha 


UDINE 2 great irregularities іп distri- 
bution a?e sure to arise. The following table, however, gives the marks 
obtained by two 12-- classessin an ordinaty school on an objective 
arithmetic test. Owing to the deficiency in difficult questions, many 
pupils at the,top got full marks (80), and» the real extent of their 
ability was not tested at all. • 
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Ex. 4. MEUS abnormal frequency distribution of arithmetic marks. 
үе, Mark, Bone. 
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Further problems occur in the interpretation of count scoves. 
Physical measures, such as 60 inches or 15 seconds, always represent 
definite amounts. No doubt exists as to their‘largeness or $mallness, 
because an inch ог? second has an absolute, standard size. Laymen 
› ahd teachers often wrongly suppose that the same is true of»mental 
measures, such as the mark awarded to а pupil's work. They talk of 
15/20 or 60% as if thése numbers implied the same level of achieve- 
ment whenever, or by whomisoever, they are awarded. Yet it is surely 
"obvious that such scores, by themselves, tell us nothing at all about the 
goodness or poorness of achievement. Such statements as: “J ohnny 
is doing well at arithmetic. He got 8 sums out of 10 correct" 
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or "His dictation is terrible; he made 15 mistakes"—are quite 
meaningless unless we also know the difficulty of the sums ané of the 
passage dictated? Johnny may be the poorest «arithmetician in his 
class, and yet score 8/10 if sufficiently simple,sumfs are set. Ог 
may be tht best speller in his class,eagd yet make 15 mistakes ifthe 
passage happeps to Фе taken, saw from „4 textbooR of organig 
chemistry. The speaker is, of course, implying that the tasks are of a 
level of difficulty such that most of 4ће class score less than B and 
— 15. In other words, the marks possess о у a relative, пої an ab- 
solute, significance, Again, When an ifspector or head teacher com- 
plains that the average ехатипанол mark of. acclass is ‘too low," ke is 
presumably comparing the mark and the difficulty of the $xamina- 
tion with subjective recollections of the achievements sof other 
similar classes. Undoubtedly there is a great deal of loose thinking 
about thèse matters, and much can be learnt from the more sys- 
tematic procedure of thg scientific mental tester nate 

The tester realizes that the goodngss or poorness ofa child's test 
performance ĉan be interpreted only by comparing it with the þer- 
formances of other similar children. Hence he applies his test to 
large numbers of children, similar as to age and other relbvante 
circumstafices, scores their performantes all in the same way, and | 
can then state definitely whether a certain performance is superior or 
inferior, and how much better or poorer it is than the average. This 
procedure is с ед the establishment of norms or Standards for the 
test. A discussion of theanaig types of test norms is given in Chap. 
IV, where*it is also shown how the adoption of analogous methods 
may greatly help in the interpretation of school;and examination 
marks, ы, 5 


.* B. RANKING е 

Ranking means listing a group of pupils or students in order of 
their attainment, Ist, 2nd, etc., down to bottom. This method may 
be applied to айу form of scholastic work, inckiding composition, 
handwyiting, and the like, which are scarcely amenable to coufit. 
scoring. It is actually the soundest of all formscof marking, in spite of 
its many limitations, because these limitations are so obvious and 80 
easily recognized. The first drawback is that the number of persons 
Who can be ranked is rather smalleEven the arrangement of twenty’ 
in order involves а good deal of uncertainty, especially rfear the 
middle of the list. Those"at the extremes are usually more readily 
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discrimirtated from one another. Next, itis clear that these ranks do 
not provide a true numericalsscale, since the numbers 1, 2, 3, etc., are 
» by no means equally spaced. Usually the differerices between the 
attainments o&Nos. land 2, 2 and 3, 18 and 19, 19 and 20 (out of 20) 
aro greater than the differences between Nos. 9 and 10, 10'and 11— 
bence'the difficulty, just, mentiosed, in ranking the medium pupils. 
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FIG. 16 Rectangular frequeney distribution given by a rank order. 


The frequency distribution is rectangular (Fig. 16), and is entirely 
‚ unlike the normal curve. These numbers, therefore, may give quite 
false’ impressions if they аге added, averaged, or otherwise manipu- 
lated. For example, a pupil who, is first in one examination and third 
in ahother should be regarded as better than, not ais equal to, a 
pupil who is second in both (cf. р. 59). 
» A,third main drawback js that these numbers are always relative 
„to the particular set of persons who have been ranked.’ Thirtieth 
place is not an unequivocal number like 30° on a thermometer. For 
30th out of 32 is much poores than 30th out of 50. Moreover, if any 
one personvhigher on the list should happen to be absent, the 30th 
rank would change to 29th. It is nobevgn analogous to, say, 30 mis- 
takes in a dictation test, since such a test can be applied to large 
numbers of children, and norms can be collected which wi]l tell us 
how low is,the level of ability to which 30 mistakes corresponds. In 
contrast, a rank position can never be so interpreted with reference to 


outside standards. n 


> 


с. QUALITATIVE GRADING AND GLASSIFYING 
Д Examination scripts, essays, etc., are, often merely classified into 
some three to ten grades which are distinguished by letters (А, A —, 
В +, etc.; о, В, etc.) ог by names (Credit, Pass, Fail; Ist class, 2nd 
? class, etc.; Excellent; Very Good, Good, etc.) This is a more 

ә 

1 The numbers can, however, be converted into a normally distributed set of 
Scores (cf. Garrett’s or Guilford’s books). And it is possible, without any con- 
version, fo correlate one’rank order with another, e.g. to ‚согзраге a teacher's 


ranking of school werkewith the pupils’ results оп a group intellige test 
Gf Chop. VD: р s group intelligence tes 
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primitive type of grading паз numerical evaluation (e.g. on a0 to 10, 
or a percentage, scale), though if is of course often applied to ecripts 
which have already been awarded percentage marks. For instance, 
examinees whose mark is 80% ог more are ipfoymed that they haye* 
obtained Рич Class or Credit, andeo n. Ср 
Since these grades m&ke no pretenge of being numerical measures, , 
we need not cavil at the eccentricity of their frequency distributions. 
They are, however, liable $o апотћев, 5 НИ more serious, defect which 
is not found among count scores or rank$, namely that different 
teachers or examiners often &mploy Widely different scales or stan- 
dards of gradigg. Although*they are supposed to represent absolute 


evaluations of achievement, théy are actually found by experimental 5 


investigation to involve such gross discrepancies that neither pupils, 
parents, nor teachers can kope to interpret their significance correctly. 
For exa mple?Hartog and Rhodes (1935) arranged for 48 School Cer- 
tificate French scripts tosbe graded indgpertdentty by six ехрепепсед | 
examiners. One of the examiners awarded 46 credits, “passes, and 0 
failures. Another gave 17 credits, 12 passes, and 19 failures, The 
remaining four were intermediate. Similarly, Boyd (1924) had 26 
essays,by 11 + children graded by 271 teachers as either Exceflent, , 
VG +, VG, G +, б, Moderate, or*Unsatisfactory. Some of the 
teachers awarded one or other of the two highest grades to as many қ 
as 15 of the essays, others to none of them. Some also awarded one or 
other of the 196 lowest grades to 17 of the essays, others to none of 
them. In view of such mesujts,eit is clear that there is no agreed 
meaning ís to the absolute amount of ability which these gradings 
should signify. А mark of Excellent may represegt*a high level of 
ability if it is awarded by а teacher who уегу seldom gives Excellents. 
It may mean little better than. average ability if it is awarded by a 
teacher who habitually grades nearly half his papers as Excellent. 
In other words, this type of marking is actually a relative one, like 
ranking. А mark of G +, or В—, or Pass, etc., only achieves any 
meaning if we khow whateproportions of candidates obtained better 
or poarer marks, insjust the same way as a rank position of 30th can 
only be interpreted if we know how many candidates were ranked 
lower than 30th. 54 агы 

Qualitative grading should then be regarded as a form of ránking 
where, instead of there being as many different ranks as there are pupils 
or examinees,*there are only some 3 to 10 ranks, any one of which 
may be awarded to seyeral pupils whose work is of approximately 
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equal merit. Its grave defect is that any one script is not usually 

compared directly with the scripts of a particular group of pupils be- 

» fore receiving its grade; it is merely compared with the marker's more 

orless vague revollections of the scripts which he has marked in the 

pasi. The only possible rational meaning for Excellent or First Class 

is that the seript which receives this grade seems tp him to rank 

higher, or to be relatively better, than aimost all the scripts produced 

vd by similar pupils which he has come across. The other grades are 

similarly referred to rough subjective standards which he has set up 

during his previous marking experien¢é. And since his experience 

naturally differs from, all other markers’ expetiences, so will his 

а standards differ, unless he has taken the trouble to discuss and com- 

pare them’ with others, and to try to adjust them accordingly. 

Finally, as in the case of rank orders, there is no possibility of 

» exact comparison of such grades with external standards or norms, 
and'they cannot be added or averaged in any satisfactory way.' 


? > D. NUMERICAL EVALUATION 
= ae and 3 
E. MIXED NUMERICAL EVALUATION AND COUNT SCORING 
Numerical evaluation consists, nominally, in assigning absolute 
„numerical grades to pupils’ dbilities, school work, or examination 
scripts, either on a 0-10, 0-20, percentage, or some other scale. It 
exhibits all the ambiguities of the qualitative grading just described, 
and possesses several additional defects of its own. Tiv other words, 
although it is the most commonly used; it % quite the worst type of 
marking. Unfortunately, since it involves numbers, it is frequently 
confused with the count scores obtained from objective»marking 
(Type A). Teachers assume that the label of, say, 15/20 which they 
award to an English composition is equivalent to a score of 15 cor- 
rect arithmetic sums out of 20, Often the two become inextricably 
mixed. For instance, partial credits involving subjective judgments 
of merit may be awarded to partially right surfs. Or an examination 
in history, science, Таприарев, etc., шау iriclude, say, five questions 
each of which is marked subjectively out of 20, the marks then bein 5 
T added as if they were count scores to yield percentage totals. Or some 
analytic plan of marking may be adopted whereby one mark is 
counted for each correct fact or idea, and so many are added for 
subjective impressions, of style or general merit. 


1 When the group of pupils is a large ойе, their grades can be converted into 
rational scores; cf, Dawson (1933), pp. 91-3, 
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М № Distributions of such marks aresometimessmooth andsymmetrical, 
but are often wildly irregular or skewed, In an examination where a 
pass mark of, say, 50% has beefi fixed, it is common to find very - 
large numbers of 50's, 5175, and 52's, but no 45's to 49's, Some. 
teachers show peculiar idiosyncrasies, stich’ as avoiding с certain marks 
altogether. For, instanee, they may gward 9, 8, 6, 5, or 3 out of 10 
У very often, but scarcely ever gite 19, 7, 4, 2° or 1. Two illustrations 
may be given of distributions ameng students or pupils who also 
took an examination which was objectively scored, ° 


Ех. §.—Distributtons of (Subjectlve) Marks for Reading and 
(Objective) Marks för Geography. 


Mark Reading F^ Géography F, MUN 
10 24 2 E 
.9. 25 3 Set 
е 8 13 te ge 
7 6 10 дакы Р 
6+ ИЕ 3 
5 25 9 oN А 
А 4. 0 9 $ e 
3 0 7 
. 2 0 10 z 2 
62 1 г 0 4 S © 
0 035 1 ей 
74 74 


The readingemarks in Ex. 5 are fairly typical of the distribution 
adopted by teachers in evalpating this subject, or composition, or 
handwriting. Note that this teacher was not, as she supposed, mark- 
ing out pf 10, but out of 6, since she only used 6 different grades, 
The fact that &bout one-third of the class get 10 implies that the 
work of, all these children was equally good, which is certainly not 
tre: The geography marks awarded by the same teacher were 
derived from a new-type test paper, and so give a roughly normal 
distribution. It is surely obvious that 5/10 for geography does not 
have the same significance, nor represent the same achievement, as 
5/10 for reading. In. the one it means approximately the average for 
the class, in the other it means right at the bottom of the class. But 
the children themselves«and their parents are likely to Ru that ^ 
5/10 always means the same thing. 

Fig. 17 compares the distributions of percentage marks obtained" 
by a large gréup of students on a new-type | examination гапа on 
ап ordinary examination of the. essay-type. "The former, though 
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Fic. "17,—Distributions of percentage marks from ба) an objecgive examination, 
4 and (5) a subjectively marked examination. 


somewhat irregular, tends towards the normal, symmetrical shape, 
and includes marks ranging from 19 7 to 86%. The latter 5 skewed, 
‘and it ignores the bottom half of he available scale of marks, since 
it ranges only from 50% to 90%. 

Education authorities would surely distrust a school oss the 
weights on whose weighing machinesvagied» say, from 3 Ib. to 14 Ib. 
each. Yet the units implied by the distributions which marty of their 
teachers employ are probably quite as variablé as this, unless some 
scheme of rationalization is applied (cf. Chap. ТУ). ft is notorious 
also that some faculties in a university use quite different scales of 
marks, ог are mych more lenient than other faculties. e% 

It must be obvious tnat such marks are not true measurements. 
60% does not represent 199 w of anything whatsoever. True it lies a 
certain distance between 0‹ and 100, but as nobody tan define what 
isomeant either by zero or perfect ability, that,fact does ngt help 
much. Like the qualitative grades of Type C, they are merely labels 
whose precise size 45 хей more or less Фавие!у by convention. 
Naturally the convention varies in different institutions. 

2 Numerical grading, like qualitative grading, actually consists of 
ranking» the scripts in relation to the marker's subjéttive past ex- 
perience. It differs merely i in that number labels are substituted for 
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verbal ones, and that most nymerical scales contain a larger number 
of different labels than do qualitative scales. This latter feature leads 
to still another cofamon defect. When an essay or a single examina- 
tion answer has to be marked on a percentage basis, ће implication >” 
is that the*nfarker can distinguish,«and keep in mind, a hundred 
different grades ef ability (or 51 different grades if he confines himself , 
to a range of 40% to 90%). Actually it has been proved by experi- 
ment that people cannot consistently discriminate more mo about 
10 grades, and some would say only 5. 2 

Ч "HE REFORMATION OF MARKING’ 
; The would-be reformer of marks meets with-many objections and 
| criticisms. Most of them eithef rest upon ignorance of eleinentary 
statistical principles, gr else “boil down’ to the view that the,systems 
in vogue at present are тту established, generally understood by = 
all concerned? and work too well and smoothly to be upset. While « 
such a reactionary attitude is entirely, urfwarr&nted, yet there аге 
| admittedly many occasions in school marking where de rationality 

of the scales employed matters little, or where an eccentric distribu- 
tion may be an actual advantage. . 

| Somg important examinatigns do not aim at measuring the e 
abilities ofall the pupils, but purely at Selecting a certain proportion _ 
of them. For instance, the English secondary school selection exami- 
nation attempts to pick out the best 20% or so of candidates for 
grammar schoa! education. The corresponding Scottish examination 
has roug hly the оррозне»риврове, namely to eliminate the poorest 
pupils who are unfit for advanced secondary school work. Mowat 
(1938)epeints out that the most appropriate examifiations or tests 
in these two instances wil] be ones which will spread out, ог discrimi- 
nate between, candidates whose marks are close to the respective › 
baederlines. The former then should yield а pesitively skewed 
distributions and give plenty of headroom, while the latter should 
yield a négatively skewed distribution and give plenty of footroom. 
The marks of the bottom ¢hree-quarters of candidates in the former, 
and ofethe top three-quarters in the latter, scarcely matter. Hence, 
it might be a waste of time to construct an examination which would 
Spread them âll out in the proper normal fashion. 


! Nevertheless, attainments tests which cover the whole range are commonly 
employed, partly because the position of the borderline varies widely from onc 
atea to anothergand partly because it is useful to have accurate educational 
quotients to assist in allocatige pupils to appropriate streams in any secondary 


school they ultimately attend. Ke 
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Next we can distinguish examinations which are constantly set in 
schoo!s for such pedagogic purposes as ensuring that all, or prac- 
tically all, the pupils have successfully accomplished a certain stage 
-in their work, or that they have learnt, say, some French grammar, 
chemical formule, or history notes, set to them for preparation. 
Other exercises are needed, notably in mathematical subjects, for 
"practice in the operations which the pupils are in process of acquiring. 
In tuese instances we may expect distributioas to show a strong 
negative skew. The tests will naturally be designed so that the major- 
ity'can get full or nearly "full marks. TH weaklings will be semewhat 
more spread out, but, even the poorest»may be able to attain half 
marks. Doubtless it is because these abnormal distributions have 
become habitual in classwork that most teachers are so blind to the 
defects in their examination marking. Such tests which are used as 
pedagogic aids should be kept quite distinct from tests ог examina- 
tions designed for purposes of measuring abilities. When the object 
is to compare different pupils or to find how able a pupil may be in 
different subjects, then they should certainly be constructed and 
marked according to the principles outlined below. Further, when a 
à ‚ Standardized (published) test of intelligence or attainment is to be 
' given | to a class or to a group-of students, it should always’ be chosen 
> so as to yield a proper frequency distribution (cf. pp. 188-9). 
Rationalization of Marking,—Marking will probably never become 
scientific so long as it is confused with value judgnsgnts. Thus the 
first point to realize is that markigg js agprocess of differentiating 
between the various levels of ability existent among the scfipts which 
are being marked, Either it is a process of artanging (Неш,ір rank 
order or of assigning them to one of a limited number of qualitative 
or numerical grades. But it is not a process of deciding whether all, 
or some, of the scripts are ‘good’ or ‘bad,’ whether the pupils ‘ought 
to have done better’ or ‘have tried hard,’ whether they ‘pass’ or ‘fail.’ 
All such desisioris should ре postponed until, the actual marking is 
completed. Б у: р 
* Often it is desirable that the examination should be of the new- or 
objective-type. The reasons for this advice appear in a later chapter, 
> but one of the main cnes is that marking is greatly simplified and 
rendered far fairer. In such examinations difficulties arise, not in the 
scoring (which is purely of the count type), but in the setting. They 
should 5e of such a length, and contain questions: of*sufficient case 
and difficulty, to differentiate among both’ the poorest and the best 
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pupils, and to yield an approximately normal distribution. Ideally 
the average pupil in the class, ос the average examinee, should get as 
near as possible to half marks, the poorest pupil little better than 
zero, and the best pupil little less than fyll maeks: For ifno examinee 
scores less than, say, 40 %» then it 15 obvious that the easy questióhs 
at the beginningarea waste of eyerylfody's time. And if no one stores. 
more than, say, 70%, ue) most duck questions at the end arg also 
wasted. е 
Humanitarian Objections. gAn PCS and quite pardonable, 
objectidn will be raised against this ideal, namely, that if marks range 
from nearly 04% to nearly 100%, the -poorér pupils will, Бесбте 
greatly т when they habitually obtain marks of around 
10% to 20%, and thgir parents will become bitterly resentfal. Some 
teachers йге so conscious of the stimplating effects of success and 
the depressing effects of failure, that they deliberately distort marks, 
giving the duller pupilsswho have tried hard more than their work 
deserves, and the brighter ones who have not done sorwell as they 
might less thar they destrve. It may be that this kindheartedness 
of teachers towards the dullards has been partly résponsible for 
forcing up their distributions of marks into the absurd shapes * 
typified by Ex. 5 above. Perhaps equally potent has been their fear 
of the criticisms of head teachers and inspectors if they admitted 
their failure to teach the almost ineducable pupils by marking their 
work on its trüe merits. Worthy as such humapitarian" objections 
may be, they do not constitute ah excuse for bad test construction or 
bad marking. But the solution is not difficult, We shall see in Chap. 
IV thát ће actual average, and range, or the scale of marks, em- 
ployed do not matter in the slightest, provided that everyone agrees 
to use the same scale. Either, then, the test may contain sufficient 
(useless) questions which are easy enough to encóurage the poor 
pupils; or (ће test may be rationally constructed, as indicated above, 
so as to yield a range"of, say, 4 to 26 out of 30, or 10 to 70 out of 
80; and when the markfhg is complete the mürks can readily be 
converted into воше more acceptable and conventional seale, such 
as 40% to 90%, by the methods described in Chap. IV. A third 
аа is to withhold all numerical gradés from pupils and 
parents, since they are so liable to misinterpretation, and instead, 
once the scientific marking is done and is recorded for administra 
tive purposes, "to 1 reconsider the scripts and attachany verbal'or letter 
labels to them that appear suitable for pupil and parental consumption. 
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Grading of Complex Educational Products —Certain types of work 
such ‘as English composition and handwriting usually have to be 
graded on the basis of total subjective impression. The same is true 
ОҒ many exemination scripts when the answers consist of a series of 
Historical, scientific, or other essays. Here all the answers to any one 
question should invariably besmarked beforé the examiner proceeds 
to another question. Whenever there аге about twenty or more such 
essays (or other educational products) to be dealt with, resort should 
be made to the quality; or product scale method (cf. p. 154). The 
marker should first skim through a fair number of the scripts until 

_ ће һав gained a rough impression of their average level and of, their 
extreme range of attainment.No actount whatever should be taken 
of his feelings that this level is laudably 2094 or deplorably bad. 
What is needed is the actual average of these particular scripts. 
Next he should look through them more carefully and pick out one 
which, typifies the average, set it aside, and call it 5. Next the two 
should be taken which appear 4o be the best and worst of the bunch; 
thèse should be called either 9 and 1, ot 8 and № Finally, others 
should be selected which seem to represent levels intermediate 
between 9 (or 8) and 5, and between,5 and 1 (or 2). These should be 
numbered (8), 7, 6, 4, 3, (2). A good deal of revision ‘of first im- 
pressions may be required before a final decision is reached as to 
these nine (or seven) samples. Once they are chosen the marking is 
straightforward; The rest of the scripts are simply compared one by 
one with the samples and each is awarded the mark of the sample 
which it most closely resembles in merit. Should one or two turn up 
which are supérior to No. 9 or inferior to No. 1, they may be given 
10 or 0, » 1 T 

One obvious advantage of this plan is that it enables (те marker 
to be self-consistent and to majntain the same standards througheut. 
Another is that jt does not demand impossibly fine distriminations. 
The numbér of samples can be reduced to буе, or extended to eleven, 
df desired, or some'scripts may be given a mark midway between two 
Samples, e.g. 64. But usually seven or nine grades are sufficient. 
Most important, however, is the fact that the final distribution will 
be fairly smooth ahd’ symmetrical if ће lumber of scripts is large 

„and the samples have been well chosen. Even if their number is small 
there should be a majority of 6’$, 575, and 4’s, and à minority of 978, 
8'5, 2$; and 1’s; and.there should be about as many Above as below 


5. The precise marks given tothe,samples do not matter. They may, 
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for example, equally i bescalled 90%, 85%, 80% . . . 55% 50%, 
so long as the marker resolutely avoids»all temptations to withhold 


90% on the irrelévant grounds that the best scripts in the bunch do * 


not seem to him to be *up to 9097 standard,’ M It is, however, probably 
better to до? ће marking of each «uéstion on a 0-10 scale. Stich 
marks for several quéstions in,an gaminagjon can legitimately ba. 
added. The final totals can then be converted into any desired con- 
ventional scale, jus£^as can the scofas from an objective test. « 
Other Typeg of Marking t is more «difficult io give "definite 
recommendations as "to the marking ОҒ work of mixed type, where 
only part of the merit or “demerit resides j Jm points which cam be , 
counted. In foreign language*examimations, history seignce, ete., 
both correct or incogrect facts, and general style, or originality ава 
the like, „тау need assessment. Sometimes these quantitative and 


qualitative features can be graded separately and later combined in * 


due proportions. It might be better if, as i$ suggested i in Chap. XIV, 
such papers were not set; either they should be definitely new-type, 
or definitely 08 the qualitative essay-type. Neverthelgss, the main 
requisites of good marking should always be attainable, namely a 
similar distribution and similar average level of marks in all subjects, ^ 
or in all the questions set in апу one subject, and an approximation 
of this distribution to normality. 

Still more difficult are the problems 6f the marker who has only a 
few scripts with which to deal, e.g. а university externdl examiner 
who is called i in to assess*hal$ a dozen degree candidates. Even if the 
papers consisted of perfectly constructed new-type tests, they would 
inevitably*yield extremely irregular and variable distributions. The 
examiner is, therefore, forced to employ subjective recollections of 
the merits of similar candidates whom he has marked in the past. 
Nevertheless, he should keep the normal curve іп mint, for in the long 
run all the seripts that he meets are likely to be normally distributed. 
He should try, therefore, to build up definite mental specifications 
of his 9, 8, 7... 2, 1 levels, and attempt, by thorough discussion 
with the med examiners, to decide where on this scale the First 
Class, Pass, or other standards fall. В 

Some additional poinls "bearing on these plafis, and the answers 
to some of the possible criticisms, will be discussed when we have, 
studied, in the next chapter, the statistical conceptions of percentiles, 
averages, and Statidard deviatioris. ов 
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> "STATISTICA, METHODS FOR HANDLING 
FREQUENCY ото 
Numerical Description of Disribution 5,—30 far we have dealt with 
various types of distributions of ШЕГШ almost wholly їп verbal 
Tonos; and have compared опе distribution with another mainly ‘by 
eye. * Mare exactostudy of them necessitates quantitative description 
and definition. We need first a numerical expression for the ‘general 
run’ of the measures, and, secondly, sorhe index of the range ог 
spread of measures. For instance, if we want to know how intelligent 
Johnny Smith is, and ask ә psychologist? he will tell us Johnny's 
score on an intelligence test and, in order to enable us to interpret 
this score, hs will add that the normal ѕсоѓе for children of Johnny's 
age is so and'so, and that Johnny's score falls at such and such a level 
> in the distribution. Again, if we аѕ a teacher what аге the heights 
of the children in her class,'she might in reply show us a complete 
table of the distribution of heights, or she might tell us with greater 
precision that the average Was 50 inches and the range from 42 to 
57 inches: SER ho 
Actually the range is not a 2004 iade» of the \ way the measures 
are spread out above and below the average, since it depends on only. 
two of the measures, those obtained by the extreme pupils who were 
measured» For example, although this class contains children ranging 
from 42 to 57 inches, all the rest might lie between 46 and 54 inches, 
and we would nardly get a fair picture of the spread if we were only 
told that the total range is 15 inches. A much better index ïs called 
the Standard Deviation—voften abbreviated 10 S. D. ого, which will 
«be described below. з 2 
A dis‘ribution of measures can then generally be defined mumeri- 
cally by means of its average and standard deviation. But we shall 
describe first an alternative system which possesses certain practical 
„advantages. This is based on what are known as the median, quar- 
tiles, and deciles or percentiles. ‘These terms all refer to certain levels 
or fixed points in a-distribution. Suppose the measures to be tabu- 
lated or arranged in order, tüen.the median is the middle measure, 
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i.e. the measure half-way up»counting from the bottom, or Half-way 
down counting from the top. The lower and upper quartiles, fisually 
referred to аз Q,'and Оз, аге the measures one-quarter and three- 
quarters way up. The decile measures are opestenth, {wo-tenths, etè., 
and the percéntiles one-hundredth, two-hundredths, etc., of the Way 
up. Stated in a different way, the ХЦ» percentile is the méasure below. 
which Х% of all the measures fall. 


5 i 
Determining the Meden- EA СОС following distri- 


bution of only: five measures. Clearly, the fhedian is 15, that is the 
third measure up frgm the bottom or down from the top. 


" . 9а аларын eae 
*-17 2а aS AD. 
15 LJ LI 
TES 13 
“ PD ^ 
< H 7 ES: N+1 
From this example it may be seen that the median is the 3 th 
* 
measure, not the Nth. . ? у 
Ех. 7.—In the following distributions, N = 8, hence N+ = 44. , 
9 1027-1519 ; 
8 9 17 
8 8 « 14 
2 7 7 12 ` К 
УЖ ‚6 Dre 
GANE 5 
6 4 4 
би 5 3 1 et 


E 


Thus the median is half-way between the fourth and fifth measures 
from, thé bottom. In the first distribution the fourth and fifth meas- 
ures are both 7, hence the median is 7. In the second, the fourth and 
fifth méasures are 6 and 7, hence we Вауе to take 6} as the median. 
In the third distribution the data аге tgo sparse for гап accurate 
median to be determined. The best we can do is to interpolate and 
call it 104. : 22 


Determining the Quartiles and Percentiles,—The quartiles are 


хс NI 
similarly obtained by counting the measures which are --2-- and 


any frony the bottom. For the distribution given in Ex. A (p. 5), 


N — 50, hence the required measures are 123 and 38} up the list. 
МА—4 к 
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The 11th, 12th, and 13th measures are.9, the 37th to 43rd measures 
are alt 17, hence О, = 9, ©, — 17: , 


» Of the percentiles, the most useful are the 100th or 99th, 90th, 


75th, 50th, 35th, 10th, 154, or Oth. The 100th,and Oth are usually, 
ІНЕ top and bottom measures in the distribution, respectively. The 
» 75th’ and 25th are the quartiles, the 50th isthe median. The 90th 


9(N + 1) 
1 


percentile or 9th decile is the measure which is from the 
Ў ” » 7 


bottoni. 5 

"Determining Percentiles from Tabulaied Data.—Whien dealing with 
measures which have. been grouped into classe’, these various levels 
are likely to fall somewhere in, the middle of a class. For instance, in 
Ex. 3 (p. 10), the median of 250 measures is the 1253th, which lies 


somewhere within,the class 98-102. Its position is found by inter- 


"polation, and the following formula may be used. > 


^ tively (omitting decimals). 


A E .. C/PN Be 
DLL Pur (тог) 
P.— the percentile. A 2” 
X, = the méasure falling at this percentile level, which we wish to 
» find, 7 


у = the lower limit of the class, sómewhere within which Х, lies. 
S = the sum of all the frequencies up to, but not including, this 


class, = 
Е = the frequency within this class, . 
№ = the total of*all the frequencigs. |, i 
С = the size of the class. с * 


Ex. 8.—To find the median of the distribution givert m*Ex. 3, 
substitute in this formula: 
Р = 00.1, the lower limit, = 98. $ = 3% 10 + 18 + 27,4 43 = 


101. 
F=47.N=350.C=5, > 2 


5 (50х250 COEM 
X, — 98 WT 100 + 101) = 100-5, (to the first place of 
decimals), 2 2 


' By applying the same formula, we find the 99th, 90th, 75th, 25th, 
10th, and Ist percentiles to be 128, 117, 109, 94, 86, and 77 respec- 


5 Comparison of Median and Average.—lf the distribution is a 
perfectly smooth and normal one, the median is identical with the 
average, In any fairly regular and Symmetrical distribution they will 
be quite close to one another Thus in Ex. 8, where the median is 
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100-5, the average happens to be 100-68. For rough work the median 

is often used as an approximation, to the average, since its computa- 

tion is far еазјег. Sometimes, indeed, it gives a*better notion of the * 
*general run’ of the measures than dogs theeaverage. or example, 

the average age of a school class паз бе unduly raised because the 
class includes tWo or three pupils tch oldg than all the rest. “Theye 
cannot very well be omitted in calculating the average, but jf the e 
median is found, they no‘longer exert a disproportionate effecteupon 

the result. Sinailarly, if a free word association test is applied to a 
child а? a psychologjcal clinic, most of the child's responses may be 
given within two to three seconds, but seme,of them may take nfuch е 
longer, and a few words may eYoke по response atealle In*sach a case 

the average cannotebe determined, and the median is the most 
reasonable measure of Фе child's speed of response. . 

Неге is another instance where the median may be very useful. * 
Suppose an investigator wishes to find the approxjmate average 
speed with whjch a group of pupils or students can perform a certain 
task, such as Ifandwriting. There is no need for himeto time each 
person separately, nor to count the number of words'each one has 
Written in a certain time, апачћеп add up and average the results. * 
Instead, he may set the whole class writing a standard passage, dnd 
tell them to hold up their hands directly they finish. Then he may 
count the hands raised, and stop the^test as soon as the median 
person raises diis hand, noting the time at which this occurs. The 
median speed of writing for 8 class of, say, 41 persons can be calcu- 
lated from the time taken by the 21st quickest person. 

Whém à distribution is strongly skewed, tht median differs 
appreciably from the average, being larger if the skew fs negative, 
smaller if it is positive. In Ex. 1 (p. 14), the median is 14, the average 
12°86. This difference is often used as a measure Of skewness (the 
formula carf be found in more advanced textbooks). — . 

Uses of Percentiles Тһе percentile levels in a distribution clearly А 
provide us with as complétera numerical description of that distrie 
butionas we may netd. For example, the figures quoted in Ex: 8 for the 
distribution of students’ scores on the Сане Intelligence Test (the » 

99th, 90th, 75th, 50th, 25th, 10th, and 1st percentiles) give a reason- 


` 3 The main disadvantage of medians, and also of percentiles, is that they cannot, 
readily be manipuláted statistically. For instance, given the medians of distribu- 
tions A and B, We cannot determine from them the median of the o»mbined 
distribution, A + B; whereas the averages of А and of B will give us the average 
of A+B. Ere m 
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ably full answer to the question—how did the students do on the 
test? Knowing these levels, we саг, bogin to interpret the significance 
* of any one person's Score, since we сап tell how he stands relative to 
this group of'students»Thus,a score of 92 on the Cattell test would 
not-be a very good one for a traihing college student. It falls at about 
the 20th percentile, which meanis,that 80% of the stutlents did better 
at the test, only 20% did worse. But when we are told that the same 
score’falls at the 94th percentile for normal adults (not students), we 
realize that it does represent quite high ability. The applications of 
percentiles to the interpretation of school marks and to test norms 

» wilt be considered latér. > 8 , 
Inter-quartile Range.—The range Of scores from the 25th to the 
75th percentiles (О, to Qs) is often called the inter-quartile range. 
It tells us what the»middle 50% of the group accomplished. Half of 
this range is often referred to as О, the semi-interquartile range, and 
this is used as a measure of the degree of sptead or dispersion of the 
distributions When the distribution is smooth and,normal, О is 
usually equal to about one-eighth of the'total ranBe of measures. 


‚ In Ex. 3 (р. 10) the total range is 61, and Q= m 75, 


ger 
which is one-eighth of 60. 


CALCULATING*THE AVERAGE OR MEAN 


More accurate, statistical treatment of distributions demands the 
use of the average and standard dévidtiorf in place of the median 
and 0. Everyone knows how to calculate an,average or arithmetic 
mean—often referred to as ‘the mean'—but most, people tise ап 
unnecessarily complex method, which is very liable to lead to mis- 
takes. We will therefore outline this and the shorter alternative 


methods. 2 › у 
(1) The ordinary method consists in adding up all the measures 

and dividing by the nuraber of cases. Expressed, as an algebraic 

3 хх „с 

formula, М = ы” M = the mean or average, X = any ong meas- 


‚ ше, and 2 = the sum of all the ..., hence ХХ signifies the sum 
of all the separate measures. As a rough general rule, this method 
„тау be used when N does not amount to more than about 20. With 
larger numbers it becomes very inefficient, unless an addin g machine 

is available, эе х 2 i 
(2) Often it is quicker to tabulate.the measures, to multiply each 


› 
> 


3 
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X by its F, and then to sum. "The formula now becomes М = 


METHODS FOR FREQUENCY DISTRIBUTIONS 


e 
41 


"У(ЕХ) 
° М t 


е 26 
Ех. 9.—The method is here applied to (һе "distribution already • 


given in Ex. 1. ie О 
е X TOM “ЕХ ы ж ©. 
А 20 оаа Ай S 2 
19. е 2% 88 * 
18 3 54 А e 
Pays 1. 19 “ 
E 16 4 
: “15 6. 
° 14 3 
e * 13 A sie 
Татар 20% 
211 2 
10: 2: 
XN 9 3 | / 
8 2 i 
7 "dedo TUN E ҚАС: 
е 6 22% 12 ў 
o 5 1 5 E 
Я 4 2 8 476, 
3 0 0 
ie 2 S ld aay tad : 
1 1 1 
О еВ Е 
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(3) The. process 15“ shortened still further if {һе measures are 
tabulated in clásses. If x is the midpoint of each class, then M= 


NTC в s 
Ex. 40.-“Тһе method is here applied to the distribution given in 
Ex x ^ 2 


X x °F ТЕК 
18-20 4:19 7 133 E 
° 15-17% 16 17 272 
12-14 13 9 = 117 
<9-11 ве 10 T7 eie 70 
6-8 7 5 35 : 
3-5 4258 3 12 . 
0-2. т. 2 276 
x №= 50 641 


~ 
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. 2 (Ех) = 641 
7 "ат. : 
Ў = ora 942 


‘Note that this vatue of М, 12-82, is not pretisely identical with 
that obtained by the previous method, 12-86, though the discrepancy 
35 only 0-3%. The reason is, of Course, that the figures are somewhat 
distorted by being grouped into,classes. We are regarding the 20's 
and the, 18's as 19's, the 17's and 15's as 16's, and so on. In the long 
run these distortions tend to cancel one another out? and therefore 
have very little effect upon the final result. Тћеугопју become serious 
if the grouping is too coarse? i.e. if thg number of classes in the table 
is too small. * ° = 

°(4) The amount of computation, when large umbers are involved, 
may be much reduced by applying the device of the, arbitrary or 
guessed mean. The following illustration may explain it. Suppose we 
wish to find the average height of a class of children, we could first 
of ll measute each child's height from the ground tothe top of his 
head, add up*the figures and divide by М. An alternative plan, which 

: would do equally well, would be to place each child by a 3-foot table, 
and measure the height from, this to the top of his head. We*would 
average the measures as before, but would add 36 inches to our 
final result. Again, we might, take a still higher table, say 50 inches, 
and measure fram this up to the tops of the heads of the larger 
children, down to*the heads of the smaller ones. If we Added together 
the former, subtracted the latter, divided ‘by N, and finally added 
50 inches, we should, of course, get exactly Шезате resu!t as before. 
Butthe figures with which we should be dealing woukd be fat smaller 
than in thé first instance. We would be Averaging, e.g. + 6, -1 
= 2, + 3, — 5, etc., instead of 56, 51, 48, 53, 45, etc. °, 

The method consists, then, іл making a Tough guess qt the mean, 
listing the amourts by which each measure deviates from (is greater 
or less than) this arbitraty mean, averaging the deviations, and 
finally adding the average to the arbitrary mean, 

> ХЕ 
М = + AS . 


т Ais the arbitrary or guessed mean, and D the difference between 
each X and A. , : 


Ex. f1.—The method is here applied t опен 
ЗО а 9 the distribution given in 
Ex. 9 (р. 41), and it yields precisely the same result for M as was 
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x 9:2 B F FD LIE 
20 d АЙ ее 
19. + 6° осте 12 
18 3% 3 15 
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16 qs Auc е d 12 = 
Е. 10* E 
14 pe 37% 3 
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13224 Ое СІЗ 4:96 5 Rue 
ls е 
о 752 
2 14 ee pas 4 А 
10 AEE 525229 „6. е 
9 2974 3 ejos 
8. 5 2 19 AL yu 
© mE : — 6 sie. 1 s 
6 cen 2 14 
5 et aR IN PEN SQUE dec 
4 GNE REID в 1 
4 25, 2) 0 4 5 
3 oat 1 M 
1 2:12 1 12 
Б 
t * 2. TEES 9 
‘ * 50 — 103 E 
A=13. М ES 103 , 13 = 5-014 + 13 = 12:86 


obtained'by the ordinary method of averaging. 
e 


ample u € D) 


А. ДЕ we had chosen А 


© es 


^isa minus quantity, 


Note that in this ex- 


and is therefote subtracted from 


А > would have been a plus 
е 


quantity namely + 0-86. х, 

(5) ВУ far the most efficient method when Л is large, say 100 or 
more, is to average, not the actual amounts,ebut the numbers of 
classes by which the tabulated measures exceed or fall short ов 4. 
For instance, suppose we have a table of the distribution of ages of 
all the inhabitants of 2%ӧуп, grouped into classes of ten years, with * 
midpoints of the classes at 5, 15, 25, etc., years. Wishing to calculate 
the average agé we take 35 as our arbitrary mean. Obviously there'is 
no need to sim fhe numbers wHo are 10, 20, 30, ... . years móre ог less 
than this А. We might just as well sum those who аге 1, 2, 3, . .. 
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decades more or less. Then, when we have calculated the average 
deviation from A in decades, we must multiply it by 10 to turn it 
back into years, before adding it to, or subtracting it from, А. 


"The formula for thi’ niethod is M = — + A,avhere d is 
athe number of the class above er below that Glass of which А is the 
midpoint, and c is the size of each class. 


» Sh 
Ex. 12,—The method'is here applied to the distribution given in 
Ex, 3 (р. 10). . erar mu - 


х қ 4 Е • Fd 
* „133-137 .7..%--6 2 1 6 
3128-432. . +5 1 5 
5 ид 6% 24 
118-122 ,. 417423 11 33 
Я 113-107 . 25-2 22 “44 
108-112. ^. +1 26 26 
. » * e ANA 
2: 108-107 “0 35 +138 

. S ж 
Е 98102. 1-1 47 47 
> 93-97 . . —2 43 86 
б 88-92 . . 3 27 81 
83-87 —4 18 72 
78-82 25 10 50 
TN, —6 3 . 18 
* © 250 — 354 


, The arbitrary mean is taken as the midpoint,of the 103-107 class, 
i.e. as 105. Classesabove and below this are numbered conéeeutively. 
“ Ж(Ғ4) + 138 — 354. 
N 250 0-864 
This is the average number of classes by which M differs from 4. 


AE ү. multiply, by 5 to get the average deviation in scores, since 


* 
2 М = 105 — 5 0-864 = 100-68 ” 

Once the student is accustomed to this method he will find it far 
quicker and more accurate than any other. If in this example he had 
used one of the first three methods, he would have Беей involved іп 
а sum totalling about 25,000. Points which require special caution 
dre: (1) making sure what measures are included in a ‘class, and what 


у 


В 79 в AST е 4 2% 
is the size оҒс; (ii) determining A correctly; (iii) mutiplying En 
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by c before adding it to А; (iv) making sure of the sign’ of this 
quantity before adding it to, ок spbtracting it from, A. 5 
© THE MEAN VARJAT4ON • ” а 
The Mean Variation or Average Deviation (М.У. ‚ог AR) is 
occasionally enfployed аз an index of the extent to which the meas" 
ures are spread out on either side of, the mean. As its name indicates, 
it is the average ofl thé differenceg between each measure and the 
mean (or sometimes the median), regardlegs of whether these differ- 
ences аге plus or minus, 
% е ы 
Ех. 13.—In this very brief distribetion, the meaneis 65. Тһе Р 
column lists the deviations. МУ. = = 2:08. If the data wefe 


in the Гога ofa беу distribution fable, the same method would 


apply, but the formula would be МУ, = TONES . 

А x ieee E у 

? 9 34 Б 
7 14 5 5 

à 6 ^ 04 
4 "16 и 
2 3:6 
N=5 lode ree: 

S M=56 2% 

® 


ӨЕ 
Clearly the М.У. will be larger the more widely dispersed or 
spread out.the measures. Its value is a little greater than that of 0, 
being generally‘ “about one-seventh of the total range of, measures, 
(This dogs not hold in Ex: 13 Owing to the shortness and irregularity 
of the distribution.) The М.У. happens to Бе of very little use for 
purposes of further statistical treatment, hence it is pud passed 
over in favour of the Standard Deviation, 
5 °, с > 
THE STANDARD DEVIATION £ 
The 5. D. is also a kind of average of all thé deviations from the 
mean, but it differs from the М.У. in that the deviations are squared 
before they are summed, and the gne root of their average is, 


taken. 
L3 
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Ех. 14.—The working is here shown for the same short distribu- 
tion а? т Ex. 13. a E 
o х «0D D? 
° 9 3-4 11-56 
> 44. 1-96 ZD? _ 29:20 _ 5.84 
PES MN UE sr 
> 4 26 5,2 42, 
` 2 36 ‘1296 pros 
ә n ER iro 
N=5 A 29-20 
М = 5:6 Saxe 


As the mean is seldom a whole number, the squaring of deviations 

> from it is troublesome. *Gefierally it js simpler to list the deviations 

from soñe arbitrary mean which is a whole number, and then to 
apply a correction according to the following formula: 


д des [ЗР (Map 


è N 
Note that the correction is invariably subtracted, whether its sign 
before squaring is positive or negative. А a’ 


Ex. 15.—1ђ the same distribution, let A = 6, i.e. the nearest whole 
number to M. 


о ыж = 
D: 30 
7 1 JE фор ke 
ас pi (M — Ај = (56 — 9: 016 
2 йе "6 c? = 60 — 0-16 = 5:84 
жур а -—24 % д 
30 


The same formula may be extended to several different types > 
distributions. 

(a) When the distribution is a short one: and all the measures are 
whole numberg, A may be chpsen at zero. The D’s will then be,the 
original X’s, so shat: ES 


E 16.— Е 
qe. ар 
> шіл d oa 
Э Я 49" N = 16.30 M? = 31-36 
6 
. Pere a= 372 — 31-36 = 584 
> 2 4 с = 2:42 ig 
186 27 
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This method is normally applied when a calculating machine is 
available, since the machine is mot—like*the human mind—iáble to ~ 
errors when coping with large numbers. E 
(b) Often the mean is not known, byt ie ealeulated'at the sapie 
e i 
time as the S.D, NowseM = 2р +A. Hence an alternative version, 
қ “ e Y 
of the same ши is: 


e 


J=] Бу = 
е а: 
è 4 

(c) When the WI are En D. classes, the quoe 
becomes: E ae 

=: A rae M e 
я ZEON е 
é s NE RANON В 


° 


Here с is the size of each class, and d the deviation of each measure 
from an pad or guessed mean in terms of numbef of classes, 


Ex. 17. тһе formula is here applied to the distribution whose 


mean was calculated in Ex. 12. It was found there that 6 =е 
— 0-864. Hence the correction c = 0-746. 
X F d (252 RAN 
1334 1 +6 36 .. 36 
128 + 4 e •+5 25 25 
9123-Ғ 6 +4 16 96 
.. 01184 ei T3 ея 
113 + 22 +2 4 88 
108 + 26 ze 1 26 
* 103 + 35 0 0 0 
м 98 —- 47 col 1 e 47 
e 93+ 43 —2 4 172 
° 884 27 -3 , 9 243 
83 + 1g —4 16, 28 
78 + 10° —5 25 250 e 
oia —6 36 108 
С 256 аи cL tyes 
< ZXFd) _ 1478 — в. 5 
ЖЕ е 
а = 5/5912 — 0746 = 11:36 
. ( є m 
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One more instance will be given, to illustrate the calculation of the 
mean and S.D. in one operation. > 
Ех. 18.—The following distribution of school marks is grouped 
into classes 8f two. Take А at the 11-12 class, фе. at 115. 
р Беата ға Еа? 
vict * 19-20 2 64 8 «32 


Зеу а еа т 5. 45 
к 15-16 4 612. 8% 16 
‚5513-14-54 +1 ДИ, 4 
В а ее Om. a: 0 
ту 1. —4 7 7 
ч өр 4 =2 8 16 
1 255-6 4 —3 12° 36 
3-4” 2% —4 8 32 
12, J -5 5 25 
EU UE n Nus d EN Ех. 213 
? ХЕ) -5 q ги 
А "N ae = — 9139. 
» This is in terms of classes. In terms of marks, 
: BED) cr «ceu 
See 1028 


Thus М = 11.5 — 0278 *- 1122. 
AED 20 607, (Xy = 0.019 


N 302736 UNO 
‚ o = 24/5917 — 0.019 = 4:858 

. 

These calculations are quite quickly апа, easily accomplished with 
a little practice, though some care is needed in checking the arith- 
metic. For example, the whole of the above calculation, usinge4- 
figure logarithms,and checking by slide-rule, occupied (ће writer less 
than four minutes. » $ е 
« The S.D. is always greater than О or«ht М.У. This is to be expec- 
téd, since any big deviations from the mean receive much greater 
weight in the S.D. {һап in the М.У. by being squared before they are 
averaged. It generaily'amounts to one-fifth or one-sixth of the total 
range, or sometimes more if the distribution is irregular, as in Ex. 18. 
This tule enables us to apply a rough check to our result. If in Ex. 18 
we had obtained o — 2 ого = 10, we should have known that there 
had been some mistake in the computation. 
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The Significance of the S.B.—It was shown earlier in this chapter 
that any distribution of scores could be fairly completely defined or С 
numerically described provided we knew some of the main percentile “ 
levels. Now ifa distribution approximates to thenormal shape (азі 
should do when obtained from a well-fonstructed test or examination), 
it can be more sfmply specified meds finding the mean score and « 
the standard deviation. The reason for this is that the normal curve е 
is a regular shape which, ‘like the Straight line or the parabola, has 
its own algebraic equation. For instance: y kx gives a straight line, 
running‘through the origin, whose slope depends on the size of k. A 
normal curve’ssequation is more complicated; but actually involves % 


nothing more than x, у, о ап certain constantsé ө • о 
PETA E 
e ч 52 z a * 
өз у= — = E “ 
сул в 


x refers to any given scoreexpressed as a deviatioh from the mean, and 
y is the frequency of that score, expressed as a proportiog of the total 
number of cases. S “ 2 
This means, that, if we know the 5.0. of a distribution, we can 
deduce y for every value of x, that is the frequency of every possible 6 
measure. For instance, although we cahnot measure the intelligence 
quotients of the whole population of Great Britain, yet, if we may 
assume that the I.Q. is a variable which'is normally distributed, with 
а mean of 100-ал4 a S.D. of 15, we can state at once just what pro- 
portions of the population һауе 1.0.5 of 100, 101, . . . etc., ог what 
proportion lies below 67 1.О., or between 120 and 1301.Q., and so on. 


Furthermore, thereare available in most statistica! textbooks tables 

; Cx 
which will tell us the value of ‘ycorresponding to each value dt regard. 
less of what the variable may be (1.Q., height, handwriting speed,ete.)- 
These are kpown as ‘probability integral’ tables. The graph given in 
Fig. 18 iscbased on such tables, but is probably easier te use, and is 


‹ SUUS 
accurate to two or three significant places. Against ;5 plotted, not y 


‹ 
—the 4сїша1 propoftion of measures which deviate from the mean 
by this amount—but the proportion which deviate Бу this amount or «© 
more. Consider for example 1.О. 115. Here x is + 15, and if the S.D. 


is 15, then == 
с 


I. That is, 115 represents а deviation of Х с above 
в а 2 е 


‹ а 

the mean. Now the lowest curve in Fig. 18 tells us that when" = ДЕ 
5 

| / 5 t 
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у = 0-16. In other words, about 0-16 or 16% of the population 
) would be expected to have 1.0.5 of 115 upwards, Similarly about 


» 16% have 1.0.5 of 85 and below, since 85 represents a deviation of 


1 x o from the mean. Jf we enquire how manyel.Q.s there are of 130 
Р; 
А upwards, “= 4-2, and the second curve gives the answer 0.023, or 
№ 


“ , e. 
about 2:39. Note that the same curves can be used for making 
deductions about any other nofmally distributed variable, provided 
allowance is made for the size of the S.D. The upper curves are for 


; = x h У 
dealing more accurately with larger values of запа smaller values of 
" * g 1 Р 


у. Each curveris,.as it were, a«10 tinfes magnification of a section of 
the curye below. y "i 

Ex. 19.—In a certain area, with a schooi population of 25,000, an 
intelligence test with a S.D. of 163 was applied. How fnany children 
may be expected (оћауе 1.0.5 of 120 to 130? And what will be the 
s, жы? LQ. among the least’intelligent 1,000 of these children? 

ntelligertce is something which varies continuously; that is to say, 

every possible value of 1.0. in between 119 and*120 can occur, 
although we"usually calculate it only to the nearest whole number. 
Strittly speaking, then, we want to know how many children have 
1.0.5 between 119-50 and 130-50; and to exclude those whose 1.0.5 
are 119:49 or less (who would count as 119's), also those whose 
1.0.5 are 130-51 or more (who would count as 1317$). Now 119-5 
and 130:5 differ from the mean by 19-5 and 30:5, that is by 1-180 and 
1:850, taking о as 163. Reading off from Fig. 18, thescorresponding 
proportions are 0-119 and 0:0322. Henee the proportion of the 
population within the given limits is 0-119 — 0-0322 = 0-0868 
8.689. 8:68% wf 25,000 is 2,170, the answer*to our first, question. 

In the second" question, 1,000 out of 25,000 is a proportion of 
0:040. From Fig. 18, 0-04 corresponds «to an x value of 1:750, 
1-75 x 164 = 28-875. The lowest thousand therefore have I.Q.s of 
100 — 28:875, wr less. Hence the answer, to the nearest whole 


number of I.Q., is 71. г 


Ex. 20.—Doe$ the distgibution of intelligence test scores given in 
Ex. 17 (p. 47) conform to a normal distribution? Invother words, are 
the obtained frequencies similar to ‘thé frequencies which would be 
expected for a truly normal distribution with thé same mean (100-68) 
апа the same S.D. (11-36)? 

These test scores, unlike 1.0.5, are discrete. Nobody could score 
1074; however carefully his score was computed. If he nearly, but 

not quite, completed 108 items, his score would still be 107. Thus 
in оч to find what’ proportion may be expected (according to the 
norma curve) to*fall within the 103-107,class, we must determine 
the proportion with scores ОР 103 or more and subtract from it the 
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proportion with scores of 108 or more, not the proportions lying 
between 102-5 and 107-5. The calculations for all classes are given 
in the following table. For example, 103 is a deviation of + 2:32 
from the mean, or + 0-200; 108 deviates + 7-32 or + 0-65с from 
the mean. From Fig. 18 we find the propocticns lying at or beyond 
these two fimits to be 0-421 and 0-258 respectively. Hence the pro- | 
portion to be expected*within the 10(-107 class is 0:421 — 0:258 =, 
0-163. N is 250, and 0-163 x 250 == 41. The penultimate column 
lists the expected frequencies (omitting decimals), апа {һе last 
column gives the frequencies actually obta-ned. The two columns 
agree fairly closely, but there are some disc/epancies. At present we 

a fnethod for telling whether these discrepancies should be — 
garded as appreciable or important, but such a method willbe , 

bed later (pp. 90-1). The full answer to this, question will | 
' therefore be postponed. | E 


C 


B 


Proportions Proportions F e- Расе. 


x Gives 7 ушы ые сез Reece ces 

values of ўс class class tained 
133+ 4+3232+ 4.285 000219 000219 1 1 
128+ -L2732-- 1241 00018 04005612 1° 1 
123+ + 2232-- +4197 0-025 0012 84 6 
118+ 41724 4153 0.062 0037 . 9 1l 
13+ +12324 +109 0:138 0076 19 22 

108-- + 732+ 4-065 0258 0:120 1-30. 983€ 

108-L' + 232-+ +020 0421 0163 41 95 
98-- — 268-- -024 0-404 0:175 44 47 
93 4 7-68 + — 0:68 0246 0:158 40 43 
88. —1268-- —112 0133 (015 28 2 
83 1768 + — 1-56 0059 0074 18 18 
78.4 22-68 + — 200 «0:023 0-036 9 10 
134 £2768 + —244 00073 0057 4 53 
68 +, 732-68 + — 288 00095 - 000 2 0 
НЕ . 100000 250 250 


« 
Qn examining Fig. 18, we see that the proportiors whose scores 
deviate фу 2So or more are very small, less than lin 100, and that 
.|» those who deviate by Зе or more only number about 1 in 1,000. 
; | Here, then, we have the reason for a statement made earlier, namely 
222 that the S.D. usually amounts to one-fifth or one-sixth of the tota: 
*. range ofa distribution. For when № is about 100, and the distribution 
is a regular оће, we spall ordinarily find the measures ranging from 
4 2-50 to — 2:50, 16. a total of So. And when N approaches 1,000 
we may find scores running from <- 3e to — Зо, a total of бо. 
There is yet another way of interpreting normal distributions, 
which will be useful to us later, that,is in terms of probabilities. 


/ У 


я 


» 


32 THE, MEASUREMENT OF ABILITIES 


Since 0-159 of persons may be expecied to obtain measures lo or 
„ more^above the mean, the probability that any one person will 
measure this much or more is 0-159. The probability that he will 
measure less is 0:841, Iņ other words, the odds are 841 to 159, or 
about 54 to 1 against his measure being as high as or higher than 
з this? Similarly the probability № а measure 45 high as + Зо or more 
is 0:00135, and the chances agAinst it are 740 to 1. This means, for 
example, that we should probably find only one child in an ordinary 
elementary school of about 740 pupils with an 1.0. of 145 or more, 
if we assume tlle 1.0.5 о? schbol children to be normally distributed 
Wh а с of 15, since145 = 100 + 3 x 15. But as many tests, have 
considerably Jarger S.D.s, they will*yield a greater number of high 
1,Q.s, and the probability of coming across such children will be in- 
creased. On the other hand.the range of intelligence ip any one 
primary school is often limited, and this would make the odds 
against high 1.0.5 fhuch"strenger. Thus i$ о is 124, then 150 is 4e 
above the mean, and we can deduce from Fig. 18 that only about 1 
іп 31,000 (in,schools with an average 1.О. 100) would reach this level. 
* 
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‘ e 
Combining or Gomparing Distributions of Marks.—In Chap. П we 
found that the marks awarded to school work or examinations are 
often, so erratically distributed that it is impossible to interpret then, 
as they stand, correctly. The first stages in the ¢ntérprttation of 
marks involve combiping and,comparing them. For instance, the 
marks for different questions іп one examination, ог different exami- 
nations in опе subject, need to be combined to give a total score. 
Often the marks on several subjects are combined to yigld a general 
view of the pupil's or student's all-round ability. Comparison entets 
when we wish 18 know whether a pupil or student is better on one 
subject than op some other subject (subjects which mdy have been 
marked, either by the same, or by different, examiners or teachers). * 
In large-scale examinations where several examiners each deal with а 
proportion of the scripts it is, of course, essential that their markings 
should be comparable. Sometimes, alsó, age allowances must be 
made; the marks of younger and older pupils need to be rendered 
comparable by the elimindtio& of'differences due to age. 

Now, none of these „processes сап be carried out directly unless 
the marka 16 be combined or compared occur irf closely similar 
frequency distributions. This is a fundamental statistical principle, 
very elementary in nature, yet very widely neglected. But with the 
aid of the statistical tools described in Chap. Ш, wé are now ina 
position’ to determine whether such distributions eare sufficiently 
similar, and if they are riot, to adjust them n such a way as to make 
them so. These tools will fürther assist in interprétation, since they « 
provide a scientific definition of those exceedingly vague conceptions" 
—the goodness or badness of a mark or test score, and the ease or 
difficulty of an 'examipation question or test item. А 

Distributions may be dissimilar from one another in three respects: „ 
(1) differences in their averages arise, for example, when one схат- 
iner is far more lenient than another; (2) differences in their disper- 
sions may arise when one examiner-spréads out his marks far more 
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widely above and below the average 4han another; (3) differences 
occur in their shapes, e.g’ normal,,skew, rectangular, etc. For the 
time being we wili consider only the most important shape, the 
approximately norma’, We may then state that marks can be directly 
combined or compared only i? drawn from distributions with the 


‚ same mean and the same S.D», , % 


The Importance of Equivalent Standard Deviations.—Identity of 
S.D's is the more impertant of these two*requirements, though it is 
less often recognized. In many instances of combination of marks, 
the averages do not matter in the slightest. d 


vw У 
Ех. 21.—биррозе that the final marks of a group of pupils or 
students Ферепа“оп their resülts in three subjects, А, В, and C, and 


that the distributions are as follows; E 
Subject * Mean Range S.D. » 
ЖАМ 60 40-80 7 
5 Ban Sy 30-79 7 
5 вс. х0 60-100 7 


Now, a pupil who is average on all three subjects obtains a total 
of 190/300, and another who is average on two subjects but top on 
the'third scores 210/300 whatever the subject. 


Ех. 22.—In contrast to Ex. 21, let us now combine the following 
sets of marks, whose S.D.s differ. 


Subject Méan Range 5.р. 
TAL 255 60. 40-80 ad) 
iB 190 » „10-90 14 

© . 80 70-90 34 


A pupil УПО 15 average all round again ‘scores 190% Buf, now a 
pupil who is average in two subjects and top in a third obtains 210, 
230, and 200 according as the third subject is A, В, or С respectively. 
The fact of doing well in subject B (which has the lowest average) 
actually raises’ the score the most because В has the biggest .D. 
And doing well in C has the least effect in raising the final total 
(although it has the highest average), because it has the smallest S.D. 
Similarly if we compare the performanses of three pupils who are 


"at the lower quartile on one subject but average on the other two, 


we find their respective marks to be 185, 181, dnd 188. Doing badly 
on B has a big effect, but doing badly on C has little effect. 
The general principle, illustrated by these examples, is that when 


‚ several sets of marks are combined, their relative ‘weights,’ i.e. their 


respective degrees of influence upon the final total, are proportional 
not to their averages but to their 5.0.5. ІГіпа final mark we wish to 
give more weight to one subject than to another, e.g. twice as much 


, 
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weight to written as to mental arithmetic, we shall accomplish it, 
not by making the average mark on the'former twice the average от 
the latter, but by making the range and S.D. of the former twice as 
large. The usual procedure i in such weighting i$ to multiply the dor- 
mer set of marks by two. This тау produce the desired effect, not 
because it doubles the average, but*becauseit doubles the S.D. Thé 
same effect would follow if the average remained unchanged · while 
the deviation of each pupil’s mark from this average was doubled. 

When the marks derived from several questions in a single examina- 
tion аге to be combined in certain proportions, this same principle of 
adjusting the 8.D.s for the questions must bé borne in mind. 

In order to combine or compare two sets of marks Whóse distribu- 
tions are dissimilar,swe should then express each set in the form*of 
deviations from their own means, and multiply‘ one set by a figure 
which will raise or lower its S.D. to the same leyel as that of the other 
set. For instance, in the two sets B and C in Ex. 21, the ratio of S.D.s 
is 14 to 33, or 4 to 1. The deviations in C will therefore become 
equivalent to those in B if multiplied by 4. Alternatively, each mark 


‚тау Бе шеги! іп Х form. The top marks on A, B, and С 116$ be- < 


come 80 + e 20 2% andes 3i 80 . All of these equal + 2:86. 
When so converted, they are all equivalent. 

Scores which are expressed in this form are referred to as standard 
scores, Or c scores, or sométimes z-scores. As we shall see below, 
they are fr equently adopted in scoring mental tests. Eor example, the 
vocatiónnl tester may wish to know whether a nian's performance 
оп a mechanical test is relatively higher or lower than on 4 clerical or 
some otlier test. The tests are likely to have different means and 5.0.5, 
but the man's standard scores can legitimately be compared. 

The Importance of Equivalent Means.—Naturally the averages of 
distributions are, important under some circumstances. If marks on 
several questions or several sfibjects are to be combined, and alterna 
tives ave allowed, só that each pupil’s final mark is derived from 4 a 
selected number of questions or subjects, ед“ it is quite essential to 
ensure equivalent means. Thus in Exs. 21 and 22, if pupils took either 
A and B, or A and C, those who were at the average level would total» 
110 on the former ¢ combination, 140: on the latter. Again, wher scores 
are to be, not combined, hut compared, it is ob¥Vious that differences 


in means will entirely upset the comparisons. The adjustment of - 
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discrepancies in means is easy—merely add or subtract so much from 
„all the marks in one subject'or question until its average is the same 


» as that for the other subject or question. More often than not, how- 


ever, discrepancies àn means аге accomp: nied фу discrepancies in 
S.D.s. In that case resort must be made to the technique already de- 
x © 


x 
‚ or to some 


LJ % a » 
scribed, according to which each "markeis expressed as 5 


modification of this technique (ch. pp. 59-62). 
LJ 
EQUATING,OF MARKS BY THE PERCENTILE 
Е ` 
е E TEÇHNIQUE 
„Айһонрһ the statistical treatment, of distributions in terms of 
means and S.D.s is far from difficult, it may admittedly lie outside 


> the scope of the ordinary teacher or examiner. Further, it cannot be 


used when the distributions are clearly non-normal. For instance, if a 
headmaster,to whom were submitted the marks listed in Ex. 5 (р. 29) 
wished to compare the pupils’ standings ой reading and geography, 
he could not ‘use this technique. The alternative, percentile, system 

» would, however, be quite suitable for making such a comparison, 
and it is probably simpler for'the uninitiated to comprehend. (It does 
not even involve looking up squares and square roots, as does a 
technique based on the S.D.) 

Percentiles will,provide a uniform scale in terms of which compari- 
sons may be made between any sets ofsmarks. Whatever the test or 
examination which a group of pupils or students takes; a certain 
percentile level ‘always means the same level of achievement (relative 
to that group of people). For instance, on looking at'Ex. 5 we might 
be inclined to suppose that 8 marks for reading represents a better 
achievement than 3 marks for geography; certainly the pupils them- 
selves would infer that this was so. But actually they are precisely 
equivalent, “since both marks fall at the 25th percentile, Q;. The 
teacher may, of course, assert that the general standard of reading in 
that class is good and the geography poor. This is an objection which 
wil? frequently be raised by critics of this chapter, and we will show 
later why we believe it to be baseless. * * 4 
б Percentile Graphs —One of the best ways of comparing two sets of 
marks is by a 7-point percentile graph, which is illustrated in the 
followihg example. . m SA 
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X Frequencies E 
Mark English Arithmetic с 
90 + аар 5 Е 
80. : ОНИ 
ЛОЯ 155 © ©. 13< 6 
4:2 60 ЕБ не 15 2 
50--” dh el XAR 10 POR 
40 + Voie USS 2 
30 + MEE ESSE 0 е" 
о * 0e 1 oe 
—6 = 
Е 57 


Ех. 23.—Тһе table shows the marks of 57 pupils, aged 12 +, in 
a Scottish school on final examinations in English and atithmetic. 
Though the tabulation is coarse, it indicates that the two distribu- 
tions differ in (ей? means and $.0.5, and that, while the first is 
roughly normal, the second is negatively skewetl. The procedure is, 
first, to determine the 99th, 90th, 75th, 50th, 25th, 10th, and Ist 
percentiles for each seteof marks. From ‘the óriginal lists these are 
found to be 90, 84, 77, 67, 59, 49, 39, and 97, 89, 81,771, 60, 53,:35, 
respectively. Draw a graph, as shown in Fig. 19, where English 
marks are plotted along the Y axis against arithmetic marks on the 


^ 


^ 


X axis. Plot these seven points (X — 97, Y — 90; X — 89, Y = 84; 
à 
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Fic. 19,—Comparison of two sets of school marks by means of a seven-point 
graph. * 
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К ~ 
etc.), and connect them up by a straight line or smooth curve which 
runs as nearly as possible through them all. This line may be ex- 
tended at both ends so as to cover Ќе total ranges of marks. From 

? it we can now read off the mark on the English scale which corre- 
spònds to each mark an the arithmetic scale, and vice versa. Thus 
41 in English is equivalent to,39 in arithmetic, 80 in English to 85 
jn arithmetic. » x : 


» $19. 1T 
Often it is not necessary to determine as many points as seven in 
order to fix the graph. Shree points, preferably the 90th, 50th, and 
10th percentiles, may give sufficient accuracy. Note that just the 
same procedure could have been used even if the distributions had 
» differed widely in range and dispersion, e.g. if one subject had been 
marked out’ of '40,’the other out of 200.1 
Interpretation of Marks by Means of Peréentile Levels. —The 
Я results of many stándardized" tests of intelligence and vecational 
aptitudes are quoted, in terms of percentiles, since the original or 
raw scores haye little meaning in themselves ; for example—percen- 
tile&relative to 13-year modern püpils, or percentiles relative to R.A.F. 
recruits, and зо on. It has sometimes been suggested’ that the same 
system should be used for school or university examinations. The 
"original marks would not be published at all. Average achievement 
would always be represented by 50, and very high or low achievement 
by marks approaching 100and 0. Though this would be fairer in some 
ways, it would be difficult for pupils, students, or parents to grasp 
that these were always relative to the group which had taken that 
examination. Percentiles are, in fact, precisely equivalent to rank 
orders, except that they always run from 100 to 0 instead, of from 1 
to М. The scheme would also not be feasible for examinations taken 
by small nümbers, say less than twenty students (cf. р. 66). Never- 
theless it would be a considerable step forward if the mark lists for 
large-scale examinations, at universities, training centres or seconü- 
ary schools, included a table of the median, quartiles and some of 
the other percentiles for thé marks which had actually been awarded. 
Teachers and students would soon learn the significance of these 
terms, and would then Бе able to see what standards of marking the 
› vatious examiners were adopting. They conld also obtain a much 
better idea of relative marks in different subjects, than from the pre- 


sent supposedly absolute marks, which actually vary widely from 
one examination to another. 


1 Further discussion of this topic may be found in McIntosh ег al. (1949). 
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Defects of Percentiles. Although percentiles are invaluable for 
purposes of comparing distributions, they may be very misleading if 
used for combining distributions. The reason is that percentile units a 
are very far from equivaleat to one another, since they are far closer 
together at or near the median than,they ate at or near the extremes, 

It can be seen from tbe probability ‘integral’ graph (Fig. 18) that a. 
percentile level of 90, i.e. the "miasüre abové which only 10% of the 
2-16 


S хе 
measures lie, corresponds to ап — оғ standard score of + 1.28. The 
с 


с 
80th similarly "corresponds to + 0:84; the-60th and 50th to + 0:25 
and 0-00. Thus the difference between the first two of these, 0-44, is 
decidedly greater than that between the second pair, 0-25, "Actually 5 
the99th, 94th, 78th, 50th, 22nd, 6th, and Ist percentiles are all approxi- 
mately equally spaced, in the‘ sense that a child who is at the 99th 
percentile is &s superior to one at the 94th, as the 94th is to the 78, с 
or the 78th to the 50th. The consequence«of this inequality of units 

is that, although we may, if we wish, average or otherwise cómbjne 

a person's peréentiles on different tests, we shall obtain thereby quite 
different results from those obtained by averaging his‘scores. 

Ex, 24.—Pupil A's marks on two examinations lie at the 98th and „ 

54th fercentiles; Pupil B's marks on.both examinations lie at the 
85th percentile. A's average percentile of 76 is distinctly poorer than 


B's 85. But if these figures are converted into х scores, і.е. into units 

which really ate equivalent, it will be found that A's actually slightly 
err 

better than B. For the = values of the 98th and 54th percentiles аге 


+ 2:054 and -Г 0-100, average - 1:077; whereas ће = yalue of the 
85th pefcentile is only + 1:036. 


à e 
“Itis advisable, therefore, never {0 use percentiles for combining 
marks. The same is true of rank orders (cf. p. 25-6). ° 
Adoption of € Standard беде for АП School ordixamination Marks. 
— The, following solution to this difficulty may be recommended. 
Each chief examiner, headmaster, or other authority who is соп- 
cerned with combining? comparing, and interpreting sets of marks 
awarded by sub-examiners, class teachers, etc., should adopt an 
arbitrary standard scale. He migkt, for instance, choose a percen- 
tage scale ranging roughly from 30% to 90%, with a meantof 60% 
and a S.D. of 10. Мом, on such 2 scale the 7 percentile points 
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ic. 20.—Conversion of two sets of scliool marks into a standard scale. 


» 


mentioned above fall at 83, 73, 67, 60, 53, 47, and 37 respectively. Each 
set of marks submitted by a teacher or examiner should be converted 
into this scale by-the graphical method already described. If this is 
done, all the marks from all the markets will be strictly comparable 
with one another, and they will be expressed in unita which are 
(sufficiently closefy for all practical purposes) equivalent to’ one an- 
other, so that they can be combined or averaged at will. And there is 
the further advantage that the scale corresponds well to the present 
conventional сойсербоп of high and low percentages. Naturally айу 
other standard saale can be chosen if preferred, and its percentile 
points determined from thé probability integral graph.! Fig. 20 shows 
the graphs for transforming the two distributions of Ex. 23 into the 
süggested scale. 97 for»arithmetic and 90 for English are plotted 
against 83 (the 99th percentile point) on the standard scale, and so 
on. The standard mark for any other mark can ђе read off. Thus 81% 


ma The * values of the 99th and Ist percentiles are + and — 2: 326 respectively, 


of the 90th and 10th they are + and — 1-282, of the 75th Al 25th they are + 
and — 0:675. « 
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in arithmetic becomes 67% standard, and 81% in English becomes 
71% standard. E 


e 


Another way of achieving the same result is ав follows. By apply- : ° 


ing the method described in Ex. 20 (p. 50), а table таугђе drawn up 

showing (ће numbers of cases to be,expected within each cláss- 

interval on the desired‘standard scale. Thus for a scale with a mean, 
of 60 and S.D. 10, the following percentages of pupils should fall 

within successive class-intervals of $ marks. | ° 


Ex, 25.— Sone ° 
^ Мак F^. c 
2 88-92 0 ог3 s E 
83-87 Я ғ 60 
o 78-82. 3 f s 
i 73-77 ce 65 we 
e 68-72 12 
-67 1 = 
58-62 * 20 . RAE 
“ 53-57 17 3 
6 48-52 12 A 
х 43-47 64 ‘ 
38-42 3 5 
ea 33-37 El 
28-32 0оғ3 
100 


Teachers or sub-examfners may be given such a table and in- 
structed to award marks to their pupils roughly in accordance with 
the fi тедкепсіев there shown (precise conformity is not necessary). 
Or, alternatively, the chief,examiner or head teacher may take each of 
the distributions of unscaled marks, award a scaled mark between 
88.and 92 to the top 0 or 4% of the examinees, marks‘between 83 and 
87 to Ще neXt 197 of them, and so on. Only very exceptionally good 
or poor cándidates—tliose better or worse than 999 in 1,000 of the 
ordinary run—should be given marks of over 90% or under 30%, 
For in г normal distribution only a proportion of 0-135 % is expected 


x К: 
to score more than + 35, or less than — 3% a 
е 


е 
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Ех. 26.— 


` Frequency 
Mark ' » + (а) (b) 
~ 20 
d у " р Exceptionz Ж 
17 we cases only 
1 16 Еа 3 
un 15 | 
Г 14. 24 3 } х 
13, 6 3 
124 12 5 
11 18 #1 
м 10 20 8 
М 9 18 7 
оТ 8 ОЧ АИЙ 
yr 6 3 
6 3 я 
esi fa E 
У Е 4 
“ 3 Exceptional , 
а 2 | casés only * 
> 1 
n 0 


100 40 


Burt (1917) recommends:an alternative standard scale of marks, 
based on a similar plan, but with an average of 10 and a maximum 
range of 0-20. The scale is given, here vith the frequencies to be 
expected when (a) 100 and (b) 40 pupils are marked according to it. 
Here also exact adherence should not be demanded. Any one set of 
marks for 40 pupils may depart from it fairly widely. But in the long 
run all the marks awarded by any one teacher or examiner should 


show the same mean, S.D., and tendency to normality as.does this 
standard scale. 
О › 


OBJECTIONS TO RATIONALIZED MARKING 

There is no doubt that many teachers and exarhiners will object 
strongly to the principle upon which this chapter is based, that the 
average and the dispersion of sets of marks should always be the 
same in any one institution. Faculties in a’university which consist- 
ently award high marks sometimes try to justify themselves by 
claiming that they get a ‘better run’ of students than do those which 
consistently award low marks. Examiners regard it a$ only natural to 
award different average marks to the answers to different questions 
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in one paper, since some questions are ‘done better’ than others. 
Teachers similarly regard their classes as,*better at? some subjects 
than at others, and so deserving of higher marke. Again, the class 
they have this year may sem to them an exceptionally ‘good’ or 
‘bad’ one, dad the compositions written by the class may be “better 
or poorer than usual.’ Finally, those mathematical lecturers and 
teachers who are particularly prone fo award very widely dispersed 
marks state that their students exhibit a ‘wider range of accompl'sh- 
ment? than do students of other subjects. S 3 
Justification for Equalizing the Ayeruge.in All Sets of Marks.— 
Many of these claims’are extremely dubious. Thus it is more likely 
that the university departments which award high marks will'attract 
a poorer, rather than а better, run of students, and that the subjects 
which are marked more strictly will be pursued chiefly by the better 
students who саге more for the subjects than for marks. Again, one 
is justified in suspecting teachers" judgments‘about the merits of their 
classes, since some teachers appear, year after year, t6 be blessed 
with good classés, and others to be cursed with bad ones. Accordifig 
to what is popularly known as ‘the law of averages,” one would 
expect the number of good pupils to be balanced in the long гай by 
an equal némber of bad ones. As a matter of fact, the statistical 
techniques outlined in Chap. V give us a method of predicting just 
how much classes are likely to vary from year to year, or English 
compositions from week to week, owing to so-called’ fluctuations of 
sampling. Here is one instance, A class of 45 pupils averages 60% in 
their generdl scholastic work on a scale of marks running roughly 
from 302% +0'90 9. From these data it may be stated that there is an 
even chance that the average level of next year’s class will Не between 
59% and 61%, and that it is extremely improbable that next year’s 
will, differ from this year’s by more than 3% or 4% up or down. 
Consider also the marking of English compositions, or of half a 
dozen or тоге separate questions in an examination paper. If the 
scale of marks егіріоуей ranges roughly from 4 to 20 (S.D. 3), еп 
we can predict that he maximum variation likely to occur in the 
average marks for different compositions or questions is 3 marks, 
provided that there are at least 36 scripts, and*only 2 marks if there 
are 81 scripts. An individual pupil may certainly range very widely in 
his level of perfotmance on different questions, but the average level 
of a group of püpils is far more stable than is commonly supposed. 
The apparent variations ih the goodness of answers are due much 
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less to variations in the pupils than te irregularities in the suitability 
of the questions or of the topics for,compositions. But if an examiner 
sets a bad question which is beyond the scope of the majority of 
examinees, or which does not allow them tg express their fullest 
capacities, there is no justification for giving low marks. It is there- 
for much fairer to giye the seme average mark to che actual average 
script in all cases, as was suggested in Chap. II. 

T: is a different matter ућеља group of pupils or students is known, 
from objective evidenc», to be selected in such a wavy as to be, on the 
average, superior or inieriot to the mean of Some largergroup. А 
special class for scholarship candidates and one for backward pupils 
will naturally show generally good and poor attainments respectively, 
But in many cases the teacher or examiner who claims to have а 
specially good or poor set of pupils or scripts is judging purely on 
the basis of subjective recollections of what he regards as average 
level., When a test or examination is applied to all the pupils of a 
certain age in a school, or in several schools, then the whole group 
should be marked according to the principles already outlined, and 
the marks for separate classes within this wider group may be picked 
out and averaged. Such averages will then quite justifiably, show a 
certain range of variation. But when a test or examination is given 
only to a single class, it is far safer not to trust the marker's subjective 
view of the class’s general level, but to award an average mark to the 
average script actually obtained from this class. Similarly the aver- 
ages for different school subjects shouldbe the same, regardless of 
whether a class appears to be better at one subject than at another, 
except when objective evidence (derived from the application of the 
Same tests or examinations to other, larger, groups of pupils) proves 
that the class’s attainments are uneven, J 

The ‘Goodness’ or * Badness" of a Mark.—We still have to facesthe 
wider question : what do teachers and examiners mean When they say 
that a pupil’s work is good or bad, or an examination paper easy or 
difficult? For endless objections of the type discussed above will be 
vaised until this is settled. Most persons who*have marking to do 
obviously believe that they possess in their minds some absolute 
standard of goodness and badness by'means of which they can 
evaluate a pupil's or a class's achievement. But numerous experi- 
ments have proved that such subjective judgments of merit or diffi- 
culty, when obtained from several'different persons, are often widely 
discrepant. It was for this reason. that we were forced to conclude, in 


а a 
SCHOOL MARKS AND TEST SCORES 65 
Chap. И, that all marking consists primarily in ranking a:set of 


performances in order, and that it is a matjer of relative comparison, 


not of absolute placement. The goodness of a child’s performance at 
some task can therefore mean only that he ranks high when his 
performancë is directly compared with; the performances of other 


children at the seme task, under similar circumstances. An асһіруе- | 


ment is good if only very few people can do as well or better, and it 
is poor if only very few people do it as,badly or worse. Similarly,ithe 
rational definition of the difficulty of a task i$ the fewness of people 
who can accomplish this task. o a ° 

‘Fewness of peoplé, in statistical terminology, connotes either 


percentile level or, preferably, М value? A mark is not"necessarily a 


good one because it happens to ‘be 80% or 9/10 and the like. But itis 
good relative fo the marks of some particular group of persons if it 
falls at the 95th percentile, or is 20 above the nean mark obtained 
by these persons, or if it stands high on a standard scale ‘whose mean 
and S.D. are kifewn. Such:designations of goodness or badness ате 
entirely unequivocal. e 

The Ease or Difficulty of Questions.—By approaching marking 
from this standpoint we obtain a rationai scale, not only of the good- 
ness or badness of pupils’ work, but also of the ease or difficulty of 
questions or test items. If, for example, five items are answered 
correctly by 16%, 222%, 31%, 40%, and 50% of а Тагве group of 
persons, then we can state that (for these persons) the items are all 


2 x 
equally spaced in difficulty. For these proportions correspond to ч; 
oe e 


values of + 1:0, - 0:75, + 0-50, + 0:25, and 0:0 respectively. 
Similarly some questions which are below average difficulty will be 
equally spaced if answered by, say, 91%, 94%, and 96% of pupils, 
for here'the proportions of incorrect answers, namely 9%, 6%, and 


4%, correspond to 7 values of — 1:35, — 1:55, and — 1-75 respec- 
tively. o ° г d 

Justification for Equalizing the Dispersion in.all sets of Marks,— 
The claim that mathematical accomplishments ‘vary more widely 
than do accomplishtnents in other subjects requires some further 
consideration. Iť is true that it is usually easier to discriminate 
between different levels of matheraatical than of, say, Englisli com- 
position, ability, because tésts in the former can generally be marked 


ne 
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more-objectively and accurately. Thus in the work of an ordinary 
school class it may be impossible to distinguish more than about ten 
different levels of essay writing, whereas perhaps a hundred levels of 
mathematical skill may be distinguished. Although there is some 
justification for attaching greater weight to marks obtained by the 
more efficient measuring instsument, yet this greater efficiency is no 
criterion of the range of ability, as the following analogy shows. In 
a school quarter-mile race, ойе referee may time the boys with a 
seconds’ hand watch, and find that some take 50, others 51, 52,,.. 
60 seconds. Another referee may possess a tenth-of-a-second stop- 
watch and finc that.some boys take 50-0, others 5011, 50-2... 59.9, 
60:0 seconds / But the range of ability is the same in both cases. 
„ It is only fair to mention that some authorities such as G. H. 
Thomson (1939) have suggested that greater variability would be 
found in mathematical than in other abilities, if it were possible to 
measure variability absolute]y. But there seems to be no satisfactory 
or convincing way of achieving such absolute measurement in 
psychology. Perhaps it is wisest, therefore, to follow the convention 
that mental testers have adopted of assuming the same dispersion 
for all abilities in representative groups of all ages. А 

The Marking of Small Gróups.—Peculiar difficulties still remain in 
applying the principles and techniques of this chapter to the marking 
of small groups. The mean’and the S.D. of marks among, say, half. 
а dozen Honours students may vary far more from year to year than 
they do among 40 or 100 students. I5 tha average is taken as 6097 
this year, the odds are even that it will lie somewhere between 57 and 
63 next year, bus it may drop as low as about 50 % ог riseas high as 
70%. And if there is only one candidate, who is awarded 60 % this 
year, a single candidate next year may deserve anywhere from 30% 
to 90%. Until such time аз objective tests are available, standardized 
on Honours students from many universities, or over severàl years, 
the examiner can only employ subjective marking scales, аз indicated 
in Chap. II (p. 35у „У à 
ж z Қ * 9) 

> ^ TEST NORMS 

Test norms are tables of the scores of large groups of children or 
adults by reference to which it is possible to interpret the significance 
of the score obtained by any реїѕоп, or group of persons, to whom 
the test is applied, The mental tester fully admits that a test perfor- 
mance, as such, tells us nothing as to its goodness or badness, and 
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ihat it must be compared with the performances of other ‘similar 


persons tested under the same conditions. Thus some tests аге stan- | 


dardized, and norms are obtained, by being аррлед to large groups 
of typical 11-year junior school children ; others are given to schocl- 
leavers aged 14 to 15, or to grammar school pupils; others to adult 
Army recruits, or to University, students, and so forth. American 
testers often publish grade norms, i.e. the standards obtained hy 
successive school classes in their elétentary and high schools. Here 
too a number of schools representative of all levels must be chosen, 
on account of the considerable variatiois in standards between 


+ different schools. А 


А mere statement of the average огапейіап scorps òf such groups 
is, of course, insufficient. Often the deciles are quoted, so that it is 
possible to state that a person’s performance is, say, as good as that 
of the best 20% of the group upon which the test was standardized. 


x : AA д 
A system based оп = values, or standard scores, is preferable since, 


as shown above; percentiie units vary widely in magnitude at difier- 
ent parts of the scale. A system which is becoming widely adopted is 


first to determine the percenfiles and then to convert these "into | 


standard scores with an arbitrary S.D. (usually 10, 15 or 20) and an 
arbitrary mean (usually 50 or 100). By this device the converted 
scores may be regarded as an equal-anit scale, even though the 
original distribution of raw scores Was markedly nón-normal. Thus 
with a S.D. of 15 and a меат of 100, whatever raw score falls at the 
84th percentile is converted to 115. The range of standard scores on 
this scale would be from about 145 to 55. a 

It should by now be obvious that the usefulness of such norms is 
strictly limited by the nature of the standardization group. Norms 
arg purely and simply a summary pf the test performances of a 
particular group of testees, and must be interpreted with reference to 
that group. For example, if an Americar: educational test is given 
to a British child and his steze falls at the 70th percentile according 
to the norms published with the test, this means that he did better 
than 7097 of the American children who took the test, but does hot 
necessarily show that he is above the British median or average. In 
point of fact American educational tests are almost useless in this 
country because their standards differ so greatly from ours (cf. 
McGregor, 1934), and, of course, because their.content and wording 


are often unsuitable, Probably the same is true of group verbal 
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intelligence tests, though non-verbal tests and individual scales such 
as Terman-Merrill and Wachsler-Bellevue (with minor adaptations) 
seem to give comparable results (cf. Chap. IX). Even between Eng- 
land and Scotland the differences are so big that few tests which 
have been standardized in one country can be used in the other, 
unlass fresh norms have been gollected (cf. Vernone O'Gorman and 
McLellan, 1955). Theappalling efficiency of Scottish teaching methods 
raises the performance of Scottish elementary school children on tests 
of the 3 R's by anything from 10% to 50% over the English level. Such 
evidence as we possess, hawever, points to no appreciable differences 
onatests of general intelligence. S у 

Many tests /Љауе been adequately standardized in London or 
Glasgow, but are not appropriate for use in the provinces or in rural 
areas where the educational level, and possibly also the intelligence 
level, may be different. Probably it would be better to provide more 
than one set of погтаз for such different types of school, so that a 
child's test performance could be referred to the most suitable set. 
Apparently even the size of school has some effect upon educational 
level (cf. Mowat, 1938). Still more potent is socio-economic status, 
since this is known to have a marked effect upon educational and 
intellectual levels.! Thus if à test has been standardized by being 
applied only to children in a rather poor area, its norms will not be 
appropriate for schools in*a relatively superior area. Adequate 
norms for, say, children aged 11 + would be obtaired only if all 
types of schools containing pupils of this age were adequately 
represented in the standardization group. Norms derived only from 
one type of schóc] may still be of some use, but the tester must then 
remember to interpret his testees’ results with reference to this type. 

The origin and the adequacy of many of the published test norms 
are so dubious that we would advise testers in schools to treat them 
all with caution, and when possible to do without them. Once 2 whole 
school class; or all the children of a certain age in the school, have 
been tested, each child's standing relative to the rest of this group 
will be known, and will provide most of the infcrmation needed for 
educational diagnosis and prognosis, without reference to any outside 
standards. And if the same tests can bt applied in a school for 


1 Cf. Burt (1937). The reasons for this effect are probably partly environmental 
and partly hereditary. Poor surroundings and upbringing undoubtedly have 
many adyerse influences upon education. But in addition these is indisputable 
evidence of a connection, albeit a small one, between intelligence and social class, 
quite apart from environmental influences. 
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several years, an extremely useful set of ‘local’ norms could quite 
easily be collected, either in terms,of pereentiles or of g scores. д, 
H 


б 
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Age norms may be defined as fóllóws. A child who obtaips a 
Mental Age or M.A. (on an intelligente test),eor an Educational'Age “ 
or E.A. (on an educational test) of x years is one whose performance 
is as good as that of the average child'of Chrenological Age qr БА 
of x years. For example, a child aged 10:6 may do as well in an 
intelligence test as aneaverage 11-0-year-old, but only score up to the 
level*of a 9-0-yéar-old on tests of reading and атићик с. He is then 
said to possess C.A. 10-0, M.A. 11-0, Reading and'Arithfnetic Ages 
of 9-0. At first sight thts system of expressing test results is far вітрі 


. 2 
than the percentile orz systems. It provides a uniform scale of units 
> . 


“ 
by means of which а child's relative pefformances on аду number of 
tests can be compared. His strong or weak points in different schgol 
subjects, or even in particular aspects of school subjacts (e.g. the 
main arithmetical skills), and his superiority or inferiority on verbal 
and non-verbal intelligence tests, etc., gan be seen at a glance from 
a table or graph of his various E.A.s and M.A.s. Nevertheless, there 
are many serious weaknesses in this type of norm, which we shall 
have to describe in detail. . 

Difficulties ih Establishing Age Norms.—First «there are almost 
insuperable difficulties in coflecting standardization groups which 
will be truly,representative of all the children of a certain age. Be- 
tween 6 and 11 years the primary schools will provide groups whose 
average scores are probably close to the averages іп the tothl popula- 
tion. Any one age group will, of course, be scattered through a large 
number of different school classes. Butthe range of ability in primary 

7 schools is restricted, since some of the brightest and dullgst sections 
of the population attend private and spécial schools respectively. 
And, as we have seen, the range or dispersion of a distribution is 
quite a$ important ùs its average. Beyond 41 to 12 years, when 
primary schoo] pupils praceed to various types Of schools, it becomes 
increasingly difficult,to secure samples whose averages, let alone 
ranges, are likely, to be typical. In a very few instances results have 
been obtained from practically all the children in a given district, 


1 Cf. Fraser Roberts et al. (1935). 
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or from all the children born on certain dates! But often the best 
. that can be done is to control the socio-economic factor. Suppose 
» that a district contains а hundred schools, and that a tester wishes 
to standardize а test on pupils in ten of them, he should see that the 
schools he selects, cover the whole range of socio-economic level, in 
‚ due’ proportion. An extension: of this methód wasused by Cattell 
(1934) in establishing adult norms for an intelligence test. His testees 
were classified according to théir occupations, and the numbers at 
each main occupationaklevel were compared with the corresponding 
numbers in (һе total population. Since he had an unduly high pro- 
portion of pro‘éssionals, and too low a proportion of labourers, he > 
reduced the weight of the fermer and increased that of the latter 
proup, before calculating the mean and the S.D. of test scores for his 
combined groups.» Xs ? 

The same difficulty in obtaining representative samples hinders us 
from tracing accurdtely the course of intellectual and educational 
growth from childhood, through adolescence, to adulthood. The 
available evidence, however, shows fairly definitelysthat this growth 
is not entirely regular, which means that M.A. and E.A. units are not 

‚ equivalent throughout. The year by. year increase of intelligence in 
an average person seems to ђе reasonably constant fron? about 3 to 
10 years, after which the rate of increase diminishes, and the M.A. 
units become progressively ‘smaller, until a constant level is reached 
somewhere in the neighbourhood of 15 years. Probably this upper 
limit is reached at different ages en diffevent tests, and by different 
individuals. When the growth curves of particular children have been 
followed throügh, they have shown considerable irregularities (Dear- 
born and Rothney, 1941). Average adolescents may show no further 
improvement after about 12 on a simple performance test: but, pro- 
vided their education continues, they may go on advancing, on 
complex verbal tests even up to 18 or 20. The upper Шай for a very 
dull child is, of course, far lower; he may never surpass the per- 
formance of an average 10-year-old on «ny test. It is not yet certain 
"whether he reaches this limit at the same С.А» as an average child 
reaches his, or somewhat earlier, A very bright child, on the other 
hand, naturally improves well above the normal 15-year level, and 
so we express his intelligence as М.А. 16, 17,'. . . up to 22 or even 
higher. But these units are entirely artificial. А 20-усаг М.А. does 
not mean that a testee’s performance is equal to that of an average 

1 Cf. Scottish Council.for Research in Education (1933). 
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person aged 20, but only that his performance is a good deal better 
than the performance of average,persors aged anything from 152 


years upwards. Wé must conclude, then, that M.A. units are so com-. e 


plex, and likely to be so uneven above about 12 years, that it ош 
be far bettet to discard them altogethey at such levels. 1 
Lack of Equivalence 8f M.A. and БА. Uni(s.—Still less is kndwn е 
about E.A. units. Possibly thé increase is fairly steady from about 
6 to 10 or 11 years, butethen (ћеге ds a strong tendency in most 
schools to force children unduly, in order «hat they may pass the 
secondary school selection examinatións. Thereafter some relaxa- . 
» tion pccurs, at least on the formal subjects. Hence tH 10- to 11-year 
unit is likely to be exceptionally large and the 12- 13-year unit 
exceptionally small. [һе units may then tail off among those ig 
modern schools, but among gtammat school pupils and university 
students theré is likely to be progress well into adulthood. Clearly 
then Е.А. and M.A, units do not proyide*comparable scales,much 
above 10 years. A child of 13 with an M.A. and an E.A. of 14 is not 
equally superiomin intelligence and in education, since 14 minus'13 
is a bigger quaptity in the latter than in the former. Nort of course, is 


this superiority of 1 year equivalent to the superiority of a child'of7 • 


whose М.А. and E.A. are 8 years. 

Not long after Binet put forward the M.A. system of scaling in- 
telligence, the discovery was made Шаба child's relative advance- 
ment or retardation do not remain constant, but increase as his age 
increases, Burt (1917) proved«hat the same was true of E.A. A child 
of 5 with ай М.А. or Е.А. of 6 will usually show an М.А. or E.A. of 
12, not 9811, when he teaches С.А. 10, In other word, a superiority 
of a year in a young child,is much greater than the same superiority 
in an older child. But the ratio of M.A. to C.A., or Е.А. to C.A., 
dogs seem to remain approximately constant, henge the I.Q. and 
E.Q. have been very widely adopted as measures of intellectual and 
educatiopdl advancement and retardation.e f 

Ambiguity in ГО. Units.*-There is yet another*flaw both in M.A, 
and LQ, and in the cerresponding educational units, namely that the 
range and S.D. of scores expressed in these units are wider for sofne 
tests than for others. For ехатрје, the application of educational 
tests to a class of 9-year-olds might show that their average E.A. was 
9, and that 20% of them had E.A.s of 10 or more. But the application 
of a group intelligehce test to the same pupils might show an average 
M.A. of 9 and as many as 30% with MaA.s of 10 or more. If this was 
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so, then clearly an advancement of 1 year, or an LQ. or Е.О. of 111 


4% (i.e. = х 100), does not mean the saime thing on different tests. Some 


^ 


intelligence tests yield 1.0.5 with a S.D? of 15, or even less, others 
yield I.Q.s with a S.D. as high as 25 or more. Probably educational 
‚ test8 show similar irregularities, The S.D. of the old Stanford-Binet 
test I.Q.s was established as about 15, and many modern tests such as 
the Moray House group tests änd the Wechsler Intelligence Scales 
for Children and Adults have adopted the same figure as a conve- 
nijent one. Terinan-Merrill E.Q.s tend to be more widely dispersed, 
their average SID. being quoted as 16$. But it varies from 124 to 20 > 
at differant agé levels. This means that 1.0.5 (with the same mean of 
100) have different meanings from one test to another, or on the same 
test at different agas. An ordinary school ciass will commonly obtain 
> a range of I.Q.s from about + 20 to — 20, i.e. from 130 to 70 on a 
test with a S.D. of 15. But the same class tested with a group test 
whose S.D. is 25 will range in I.Q. from 150 to 50. To take another 
example: a class of very bright children may have am average І.О. of 
115 ona Moray House test. The same class tested with, another group 
> test may have an average 1.0. of 125. The following table illustrates 
the proportions to be expected in the total population with I.Q.s 125 

or more, or 75 or less, on tests with various S.D.s. 


5 


чь eye 
124 ао 2} 
15 Кы бы СД; 5 
> 16% о, 
р 20 Alv ИН CERTI 
25 : eo 16 й 


Unfortunately it is impossible to assert that any particular value 
of the S.D. of LQ.s is the ‘right’ one, and that the other values are 
too large or too small. The value depends on certain features of the 
test which are stilesomewhat obscure,,aad are too complex for dis- 
zussion here. The figure 15 is implied when we talk of 70 as the 
border-line for mental deficiency, since most of the testing of border- 

1'The main factor seems to be the degree of heterogeneity among the items or 
subtests. The more homogeneous the test material, the higher will be the 5.0. 
of 1.055. Hence tests composed mainly of verbal reasoning items like Cattell's 
Scales 1 and Ш tend to have higher S.D.s than more miscellaneous tests like 


Stanford-Binet. But in addition certain types of items seem to be intrinsically 
more ‘spreading’ than others. * - 
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line cases has been carried aut with the Stanford-Binet test whose 
S.D. kept fairly close to 15. The only satisfactory solution to this , 
problem is to give up the Mental and Educational Age system of * 
calculating 1.0.5 and E.Q.$, and to establish separate percentile er 
standard-score norms at each age leveldor which the test is suitable. 
The conventional fornf of 1.О. has been, anq is still being, used fare 
too loosely, as though it represented a standard scale of meagure- 
ment, independent of the particuldr. test employed. Actually,” like 
any other test gcore, its interpretation is possible only if its $.D. (as 
well as the mean of 100) is taken int} actount. Tite derivation of 
standard scores may be somewhat more difficuli, for untraified , 
testers to grasp. But such scores do hawe a uniform sinificance with 
different tests at diffegent ages, and—when the S.D. of 15 is adopted 
—they dq, provide a scale which is very closely comparable to the 


older, generally accepted, scale of I.Q.s. " 
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‚ RELIABILITY AND THE THEORY OF:SAMPLING 


Untsustworthiness of | Results obtained from Small Numbers of Persons.— 
Thiscliapteris likely, wemust admit, to cause considerable difficulty to 
readersunfamiliar with statistical concepts. It introducesa point of view 
with regard toejlucational measurements and experiments which may › 
at first seem уйу complex and,involved, yet which is absolutely essen- 
tial toa proper understanding of scientific education and psychology.’ 
The ‘scientific investigator naturally wishes to arrive at results 
which are true for people in general. He wants to prove that a certain 
method of teaching’gives superior results with a// children, that the 
correlation? ‘between intelligence tests or school examinations and 
subsequent scholastic achievement always reaches, certain figure, 
or he wants io know the average test score of all persons brought up 
in 4 particular environment, e.g. of rural as contrasted with urban 
children, or Scottish as contrasted with English, and so-on. But he 
can seldom, if ever, measure all the persons to whom his conclusions 
are supposed to apply. He is forced to work with samples of the 
population, and he is therefore entitled to regard his result—whether 
it be an average, or a difference between two averages, ora correlation 
coefficient, etc.—only as a sample result, which might alter to some 
extent if he repeated the work on other samples of the population. 
He must, of course, take precautions in the first place to sce that 
his sample is not what we call a biased one. Supposing his problem 
is the securing.of norms for а group intelligence test, and he wishes 
to find the average and the dispersion of scores among shildren a ged 
11 ++, he must, as we saw in the previous chapter, avoid choosing 
only schools in peor areas or only private schools. But even when 
such bias in selection is eliminated,’ he would,still have to content 
himself with a limited sample of the children aged 11 + in the 
country as a whole, and the average of:this sample умеће differ to 
some extent from the true average. Similarly if his problem was the 
1 A particularly clear explanation Has been given by Thouless (19395). 
2-The reader who is unfamiliar with: the conception of Gorrelation may be 


advised to study pp. 92-5 of the next chapter before continuing witk this 
chapter. В 


RELIABILITY AND THE THEORY ОЕ SAMPLING 75 


comparison of the intelligence of children in suburban and industrial 
areas, he could not measure all the schaols of these types. Thus the 
difference which he obtained would only be an approximation to the , 
difference that might result from much more complete ipvestigations 
Here are two concrete instances. , 4 ч 254 
Ех. 27.—Tweglve of the writer’s training college students swere, 
given an American objective test? of reading ability, namely the’ 
Nelson-Denny test, which measures knowledge of word meanirtgs 
and comprehension of paragraphs. Their scores, arranged in Órder, 
were: 140, 136, 135, 133, 126, 120, 109, 108, 98, 89, 87, 82. 5 
Nowe most of these scores are well above the average for American 
college students, nafnely 96. Their average, 113-6, falls at the 78th 
American percentile. But obviously the scores art very wariable. * 
The question is, then, are we or are we not justified in'deduting from 
them that Scottish training college students possess higher, reading 
ability than American college students? Obviously no such deduc- 


tion could bé made from the score of only a single student, but does е 


the average of a sample of 12 provide a much surer basis? The 
12 were picked at random from amóng the writer'se264 students. 
Thus he was careful to avoid selecting only those, or a prepon- 
derance of those, who had an Honours degree in English. And men 
and women were equally represented, since it was conteivable that a 
sample composed only of women or only of men would be biased. „ 
Yet if is quite possible that he unintentionally picked an undue 
number of clever students, and that if he picked another dozen, 
their scores might be mostly below 100, and their average no higher, 
or even lower, than the American figure. How liable then is such 
an average ofetwelve measures to vary? 2% 

The question сап be answered in this instance, since the same test 
was giver’ to 21 other sets of 12 students, also chosen at random. 
The 22 averages range from 103-6 to 126:8, their distribution being 
shown їп the following table. As we suspected, the averages are 
rather variable, though less so than the separate scores, which range 
all the way from 62 to 160. And though our first average of 113-6 
happens to be fairly close to the grand average of 1149, yet we might, 
in theeluckeof the draw, һауе obtained an average as much as 10 
points above or below, this figure. " 125% 

o Average Mark Frequency 
ine a Тар 
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118 + 
. 1154. “- 
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Ех. 28.—The correlation coefficient, was calculated between the 

marks of 25 students on an, examination given near the beginning of 

* the year, and their marks on another paper at the end of the year. 

? The coefficient was + 0-53. But 25 students is only a very small 

sample, and'similap results calculated for 13 ofher similar samples 

varied widely, as shown in the following table. 

4 Correlation > 4 

Coefficients ^" © 

avt E -- 0:60 to + 0:69. 

4 0-50 to 0-59 . ^. 

+ 0-49 to + 0-49. 

a + 0:30*to + 0:39 

» у + 0-20 to + 0:29 


3 


ооло 


+ 0:10 to - 0-19 
0:00 to + 0:09. 
7 А — 0:10 to — 0:01 .. 
> А л 14 

Thus the original^figufe is, far from trustworthy, and even the 
average coefficient of + 0-315, for 350 students, though probably 
much more reliable, might also alter to some extght if the same 
examinations»were compared among other similar large groups of 
students, — ' 

The Scientific Standpoint.-.—The scientific investigator, «therefore, 
always looks on any numerical result as a sample one which may 
deviate more or less from the truth. Even when he has excluded errors 
of selection such as would make the persons he is measuring in any 
way unrepresentative, he has to congider (һе probable extent of these 
so-called chance errors of sampling. He therefore goes onto assume 
that it is theoretically, though perhaps not practically; possible to 
obtain a very large number of other such samples. The two tables 
given above are instances on a small scalé. They both bring out a 
vital step in our argument, namely that such additional samples tend 
to conform to normal distributions. The numbers involved aresmall, 
hence these distributions are irregular, but there exists ample evi- 
dence to show that larger numbers would eventually yield smooth 
normal curves. е 5 

Let us take a fresh instance, where only a single sample result is 
available, in order to illustrate the»further stages of the argument. 


ə 


| 


» 


Ex: 29.— The average scores for 136 men and i14 women students 
on the Cattell Intelligence Test ПІ (cf. Ex. 3, p. 10) were 101-55 and 
99-65 respectively. Thus the men were 1-90 points.superior. 


Can we deduce that men students, of the type tested, are in general 


RELIABILITY AND THE,THEORY OF SAMPLING UM. 
better than women, or would we, if we applied the same tests to 
many other similar groups, find, the difference to be larger in some , 
cases, negligibly small in other cases, or even reversed in others? € 
This difference, which we Shall call d, is oply,a,sample one, which 
should be régarded as falling at some point on the distribution ofa 
large number of possible alternative, 475. Clearly the best posible » 
estimate of the true difference ‘between men and women would bea 
grand average of all these hypotheti&l d's. This quantity—the mean 
of the distribution of d's—we shall call D. The particular d which we 
know is а more or less accurate approximation to ЈА 

A formula has been worked out by means of whith it is possible —, 
to predict, from the actual data available, what wqulé bace, that is 
the S.D. of this distripution of 475. The formula will be given later, 
In the present instance iteyields о, = „1-41. As the distribution is a 
normal опе, tllis figure will enable us to define its limits. We сап state © 
that about one-third of the d's must fall between D and D +, 1-41, 
another third between D and D — 1-41, and that abouf’one-third of 
them must деуіме more widely from D. Scarcely any of them, ‘or 
only 0-135%, will be as large as D + 4-23 (i.e. Зо), and*only 0:135% 
will be as small as D — 4-23. Hence 99:73% (i.e. 100 — 2 x 0135) • 
of the 25 Йе within the limits D + 4-23, and the odds against any 
d lying outside these limits are 99-73 to 0-27, or 370 to 1. As a general 
rule, the probability is 370 to 1 that amy d which deviates from D 
purely on account of chance errors of sampling will not do so by 
more than 30,. ев • Е 

Certain other probabilities have been deduced from the probability 
integralograph (Fig. 18), and are listed in the followifig table: 


Limits “Frequencies of d's Oddsagainstd's Р 
falling outside these falling outside’ value 
М limits 
$ 0:6745 с, 2х 250% 4 1191 50 
1:00 о, o 2х 159% Nearly 2 to 1 -30 
+ 1-6450, 2х 50% 101401. :10 
+ 1:96 с, РАЗНИ EE > 201 :05 
+ 2:58 c, 2х 0.05% 100 to 1 aims. 
+ 329 0, 2х 0.005% 00001 01. 


Now these predictions must apply to the particular d which we 
happen to know, whose үйіне is 1-90. The odds àre'very much against 
its deviating as much as 3:29 о, from D, but it might quite possibly 
deviate as much as, say, 1-96 о, since the odds against this are much 
lower. We can therefore, as it wert, reverse the argument, and state 
with’ practical certainty that D must lie between 1:90 + 3:29 а i.e. 
.. LM 


rj E 
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between -+ 6:54 and —2:74. We do not know, of course, what is the 
. value of D, but if it fell cutside, these limits, our d would be more 
> than 3-29 о, removed from it, and the odds against this are 1,000 to 1. 
Similarly the odds are 100 to 1 that it'lies between 1:90 + 2-586, 
— + 5-54 and — 1.74; and 20 to 1 that it lies between 190 + 1:966, 
‚ = + 4-66 and — 0:86. Thesedatter limits afe commonly referred to 
as the 01 and -05 confidence limits, respectively, of our obtained d. 
The Standard Error аз an Indéx of the Trustworthiness of a Result.— 
It may be seen then that о, provides us with an index of the margin 
of uncertainty‘in our d. Ft tells us that the difference of 1'904п favour 
of men eae a is by no means trustworthy, since the true vajue of; 
D may be-codsiderably larger or smaller. 
> This. is the manner in which the scientific investigator always thinks 
of the results of psychological or educational experiments. His ob- 
» tained result, being only a sample based on a limited nümber of cases, 
is likely to deviate fo a certain extent (тога the true result which he 
would obtain if he could experiment on a very much larger set of 
реорје. He can, however, work out the &,, often peferred to as the 
Standard Error or S.E. of his result, which tells him, not precisely 
» how much his result is in error, but, what is the probability that his 
result is in error by various dmounts. There is a 2 to 1 chance that the 
error may be no bigger than the S.E., and the odds against the occur- 
rence of a larger error rise very rapidly, so that itis extremely unlikely 
to be greater than three times the S.E. "T 
The Probable Error.—The figures ct the head of the table, above, 
show that there is an even chance of D lying within, or lying outside, 
the limits + 0-6745,. This particular fraction of the S.E.is known 
as the Probable Error or P.E. The name is not a very good one, but 
in a sense it is the most probable error, since the odds against either 
larger or smaller errors than this are higher than 1 : 1. In our example, 
- the P.E. = 06745 x 1-41 = 0:95. This means that D is just'as likely 
to lie betwéen 1:90 + 0:95 and 1-90 — 0:95 as it is to lie olitside these 
limits. в 5х ы 
» The Р.Е. is frequently used as an index of the unreliability or 
margin of uncertainty, in preference to the S.E. Instead of saying 
that the odds against D deviating from d by 329 x'S.E. аге 1,000 
to 1, we can say that a deviation of 5 x Р.Е. possesses this degree 
_ of unlikelihood, since 5 x 0:6745 is approximately equal to 3-29. 
Or if + 1-96 x S.E. are regarded’as the limits within which D may 
reasonably be expected to Не; we can alternatively denote these limits 
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as + 3 x P.E. Often a numerical result such as our d of '1-90 is 


stated in the form: 1:90 + 0-95 уеге the latter figure is the Р.Е. of . 


the former. While both the 5.Е. and the P.E.'indicate the likely 
amount of error in some numerical result, the-S.B. is; as we have 
seen, primarily a measure of the dispersion of all the possible sample 
results. The sanfe is trie of the Р.Е. іп the hypothetical distribution - 
of sample d’s, the limits D + P.E. to D — P.E. enclose 50% of these 
Ф, and 25% of them are greater than D + BE., 25% less Шап — 
РЕ. Note therefore that the Р.Е. is ійепіс with Q, the semi-inter- 
quartile range, of this distribution. For the'limits О, 10 Оз also mark 
off the middle 50 % of measures in a normal distribetion. — * 

The Statistical Significance of a Difference. —Marty experiments 
have shown that the scores of comparable groups of men and women 
on intelligence tests are practically identical, or that апу differences 
found are not reliable. Since, ір our example, the odds are only 21 
to 1 against D being gre&ter than 4:1 ог 1655 than — 0:92, it,seems 
quite possible that here too the d is not reliable, and that the true 
value of D is zefo. If this be so, then it is easy to determine from the 
probability integral graph the chances that a sample nfight show a 4 
of + 1,90, merely through errors of sampling. 1-90 is 1-35 x 7-41, 
or 1-350,. Consulting the graph (Fig. 18), we see that 9% of measures 
in a normal distribution lie at or beyond 1-35¢ from the mean. Hence 
when D is zero, 9% of the sample d’s &mount to + 1:90 or more, 
and 41% lie Between 0 and + 1-90. Thus the odds against а 4 of 
1-90 being due to errors Of sampling аге only 41 to 9, or about 44 to 
1.1 In other words, it might occur as a matter of chance once 11 four 
or five testings of similar samples of students. Such odds are far too 
low for us to put any trust in the conclusion that men students are 
better than women at this test. Had d been 3-63 (i.e. 2:580,) we might 

Ф Тһе acute reader may well ask here: wlty not contrast thé 976 of sample d's 
of 4- 1:90 and over with the 9175 of smaller and of negative d's, thus arriving 
at odds of About 10 to 12 This procedure is denated as the one-tdiled test, since 
it is based on the Фр end or tail of the normal distributjon of d's; whereas the 
test described іп the text is the two-gailed. Many statisticians hold that the one- 
tailed test should be usfid when it is desired to showgthat a difference is as large 
or larger than a certain amount, whereas the two-tailed test is appropriate when 
attempting to prove that thes difference Нез outside certain limits in either 
direction—positive or negative. Others, however, prefer to use the twortailed 
test under almost, all circumstances, since it provides a more conservative 
estimate of the trustworthiness of their reSults. By following this practice we are, 
in effect, implying that the true sex difference might be in favour ОЁ men or 
women, rather than that the true difference might be in favour of men or else 
be zero. ° 5 
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have felt reasonably certain, since this difference could arise by 
. chance only once in a hundred times. It would then be referred to as 
v. statistically significant at the -01 level. Note that ‘significant’ does 
not necessarily imply meaningful." We are not, concerned here with 
the psychological interpretation of any difference. Statistically sig- 
о nificant refers purely to the reliability or trustWorthiriess of the figures. 
„А difference between the results of two groups, divided by its S.E., 
is often called the Critical Ratió or C.R. In this example the C.R. is 
only 1:35. It is customary to place little reliance in a difference whose 
СВ. is less than 2 or préferably 3. Alternatively a d is regarded as 
significantly grgater than zero if it exceeds 3, or preferably 5, times * 
its P.E. But the horder-lines are quite arbitrary matters of сопуеп- 
tion; we are certainly not justified in rejecting all smaller differences 
because they are conventionally non-significant. Зиррозе the С.К. 
» was 1:645 (d = 1-19), with a probability of 1 in 10. Though this 
might easily arise by chance, it does not prove that the true value of 
Deis zero. Indeed, according to the -01 confidence limits, the true 
value might lie anywhere between -++ 4-82 dnd — 2:44. Such a critical 
ratio is said to be of borderline significance, since it suggests thata 
» true positive difference might be slightly more likely to occur,than a 
Zero or a negative one if we were able to investigate further samples. 
Another note of caution is necessary. Even if d had been large 
enough in the above example (relative to its S.E. or P.E.) to be 
accepted as significant, the conclusion that men “аге somewhat 
superior at the test would have applied orily to the type of students 
tested, namely Scottish university graduates, who had chosen the ' 
teaching career. It could not be extended to English training“college 
students, nor to other university students, far less to men and women 
in general, since the persons actually tested could not be regarded as 
typical or representative of these wider groups. Again it would -be 
safer not to claim that such a difference was а differerice in'intelli- 
gence, but only in the ability which is measured by this particular 
group intelligence ‘test. The statistical treatment which has been 
applied to the distributions of scores only infornis us how sound and 
reliable are the results in respect of other similar testees, ie. it takes 
care of sampling etrors, but it does not eliminate errors or bias in 
selection. As mentioned in Chap. I, statistical methods cannot im- 
prove data which are in any way inaccurate or defectiye ; they merely 
bring out the significance (or lack'of significance) of the data actually 
obtained. 2 E 
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FORMUL/E FOR THE STANDARD AND PROBABLE ERRORS OF 
NUMERICAL RESULTS 


Common sense tells us that a numerical result such as an average, fe 


or a correlation, ога | difference between two ayerages, will always be 
more reliable the greater the number of cases on which it is baséd. 
In Ex. 27 (р. 75pit was*shown that the Averages obtained by ев of , 
twelve students on a test of reatling ability were exceedingly variable. 
Obviously twelve is far top small a Sample upon which to base €on- 
clusions. In order to make a trustworthy comparison between the 
performance of Scottish and American students at this test, we 
should need mych larger MES Most of the formule for relia- 


bility involve the апу um N which means that by qiddrupling 


the size ofa sample, we Сап halve the, variability of any result com- 
puted from this sample, or double its reliability. Another i important 
factor which governs the reliability of aresült is the variability among 
the original measures. Suppose that, in Ex. 27, all the men students 
had scored 101¢or 102 (average 101-55), and all the, women had 
scored 99 ог „100 (average 99:65), we should certainty have been 
entitled to claim that men students do better at the test than women. * 
But it is bécause the men's scores range all the way from 73 to 134, 
and the women from 78 to 129, that we felt suspicious of the difference 
in averages, and applied statistical treatment to it. There was so much 
overlapping that about 42% of the women obtainéd higher scores 
than the average тап, anti about*42 % of the men ‘scored lower than 
the average woman. Hence the reliability formule also involve some 
measure vof. the, dispersion or extent of variation of the original 
measures. LI 

The S.E. of a Difference. —This S.E. is given by the formula: 

Ф. А r Ж СЛ аз? b 


: NUN Р 
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Неге c, and o; are the 5.0% of the scores in the two groups, М, and 
М, are,their numbers. The Р.Е. of a difference is 0:6745 times this 
quantity. 3 92 У d | ^ 

e { 

Ех. 30.—As afresh illustration, 2 us compare the scores on the 
Nelson-Denny geading test of students having Honours degrees in 
Arts with the scores of students having Honours in Science or 
ordinary degrees. The essential data are as follows: 
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SON M 6 
Honours Arts students . 66 129-47 16-42 
Other students У . 398 109-97 18-90 


„Тће difference between the two averages is 19-50. 


ee Oe 16428 | (uus 4/4088 + 1-804 = 2:427 


Herc the critical ratio is D approximately 8. Such a difference 


could not possibly осеше through chance errors of sampling. Hence 
we may conclude that Hünours Arts students, of the type tested, are 
definitely superior to'other graduate students at this test. 
пре у Ly bi 
S.E. of a Difference between Averages of Sgores which are Inter- 
correlated. —NWhen one group of persons 4akes two tests, or if they 
repeat a test, and we wish to examine the statistical Significance of 
the difference between their, two average scores, we must use а 
modified formula. This applies whenever there exists a correlation, 
ғ, between the two sets of scores. А o 


~ €» = dx (о,° Xi — 27003) 


Ex. 31.—Marks in English and arithmetic were awarded to a class 
of 57 children. These are tabulated in Ex. 23, p. 57. Apparently the 
arithmetic marks run higher than the English ones. Is the difference 
significant? The figures are as follows: 


~ П о 
English Mean Difference c 3 N 
nglish . . . 67-615 ЭЖ 0512-94” ; 
Arithmetic. | 70510 2895 “ре °” 


© 


The correlation between the two sets of marks is indicated by a 
coefficient of +; 0-542. + о 


а, = Ма 12947 + 14-64% — 2 х 0542 x 12:94 x 14:64) — 1-761. 


The difference of 2:895 1 is only 1-64 timss its S.E; hence it is not 
significant. 7 


LI о 
: 3.Е. of a Mean. —An average is, as we have seen in Ex. 27, quite 
. аз likely to vary through errors of sampling as a difference between 
averages. Hence it is very useful to know how near to the true 
average of a very large group is likely to be the ayerage obtained 


ХАп alternative version of the formula, which obviates calculating, the 
ооп coefficient, is given below in the section on small samples (р. 85). 


L3 
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from a group of limited size. The 5.Е. of the mean is often symbol- 


ized by o,, and is.given by ће агтија“ о, = — Here о is, of * 
VN ° 


course, the S.D. of the ofiginal scores, whqse, average has been 


computed. ва 

a * " 2 ie А 

Ex. 32.—The twelve scores’oh {Не Nelsofl-Denny test, listed in 3 
Ex. 27, have a mean of 113-58 and S.D. of 20-07. ША 
20:07 ы E 
4 mI одде 
| б, vii 5; 94. 5 

Тһе Р.Е. of М js 0-6745 x 5.Е. = 3:908. * 


Thus the true mean for students of the type tested may just as well 
lie outside the limits 113-58 + 3:91 (i.e. 117-49 (о 109.67) as within 
them. It may be as mùch as 2:580, away from 113-58, i.e. as high as 
123-66 oras low as 103-50, but it is unlikely to*exceed these 0:01 
confidence limits. Obviously the figure is extremely unreliable. x 

By contrast the mean«for 264 students’ was*114-9 and the S.D. 

20- 2 4 
20-15. Here о,0-- 1. = 1-24. Тһе sample is 22 times аз large, 
hence а, is His its previous value. The P.E., = 0-836. From these 
4 y 

figures it follows that there is an even chance of the true mean lying 
between 115-7 and 114-1, and that it almost certainly does not lie 
outside the limits 118-1 and 111-7. D 


ч 


S.E. of a Standard Deviation and of a Difference between S.D.s.— 
Not only the mean but also the S.D. of a set of measures may be 
affected by errors of sampling. The formula for the S.E. of a S.D. is: 

de^ 


2 . 
3 а = VON 
Гір S.E. of a difference between two,S.D.s is given фу: 
guo. wb ой SN 


% 5 
Ех. 33.— The S.B.s of the intelligence test scores of men and 
women students were 12:36 and 9:95 respettively, indicating that 
the dispersiomof ability was greater among the men. The 5.Е. of the 
difference of 2-41 between these two figures is: g 


Ы 12:36? 9:95? . 
lp PIC BET eate Sag 008, 
© „2256 * a3 Da А us 
The difference is 2-42 times its S.E., and this shows a fair degree of 
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statistical significance. The odds against such a difference occurring 
through chance errors are 49-23 to 0:77, or 64 to 1. The result 
therefore allows a reasonably ströng presumption. that men students 
of the type tested are in general more spread out in their ability at 
ah intelligerlce testtthan are women students. 


‚ SE. of a Percentage and of a Difference»between Percentages.— 
Often it is not possible to measure the amounts of some quality in a 
тор of persons, but only to state what,percentage of the persons 
possess or do not possess it. Suppose that in several schools 60% of 
the boys and 65% of the girls regularly take milk, does this indicate а 


reliable sex difference? Again the trustworthiness of the result depends < 


on the number of cases. If p is the percentage, and 4 = 100 2: р, 
then the S.E. of p is [5 and the 5. Е. of a difference between two 


percentages is: 


* D > 


Pid 4 Pala 


^ E 


N № 


Ех. 34.—From how many children wouid it be nezessary to obtain 
milk-drinking records in order to be practically certain that the 


‚ difforence of 65% — 60% is reliable? 


This example is posed in the reverse form from Exs. 3033, but the 
same method is employed. If a difference of 5% is reliable, it must 
be about 24 times its S.E. That is, the S.E. must not be greater than 
2. Let the number of both Ъдуз and girls in the schools where records 
are obtained bé N. Then: 


160 x 40", 65 x 32 
2= | д | кез КЕ 


The answer is, then, that the sex difference of 5%.would’be signi- 
ficant if based on some 1,200-odd boys and an equal number of girls, 
but that it would not be so if the numbers were much smaller. 


One of the commonest faults in writings about educational] topics 
is the quotation'of percentage results without any indication of the 
reliability of these results, often without even any mention of the 
actual numbers of cases from which.the Р.Е. could be calculated. 

It should be noted that all the formule quoted above for the S.E. 
of a difference (between means, Fetween §.D.s, and. between per- 
centages) are derived from one and the same formula: 


- ву = Мо? + 0,2 — 27103 
where оу and о, are the S.E.s of tlie means, $.0.5, апа percentages 
respectively (net the S.D.s ef the original scores). The third term, 
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Зтојау, applies whenever there is any correlation between the óriginal 
measures from which the meang, $.D.s, er percentages are derived, , 
But when, as in most of our examples above, there is по such соггеја- * 
tion, the term vanishes. * » i. s . г; 

Statistical Treatment of Small Samples.—It has been shown by 
Fisher (1930) ard othefs that numerical results obtained from Nery e 
small samples are even less reliable than the statistical treatment 
described above would lead us to expect. Unger such circumstances 
it is better to calculate a S.E. by substituting М — 1 for N in the 
denominator. 5 Ён . 

The resulting critical ratios are denoted by the symbol t, dnd « 
special tables are available which shew the probability of various 
values of t for differept numbers of савев.! For a difference between 
the means of two groups; the proper formula is:, 

а 
Es Zw Xx г 1 { 
E Ue 
Уху? and Хх»? die the sums of squared deviations from the respec- 
tive means, If calculated from raw scores, or from arbitrary means, 
the corrections described on рр. 46-7 must be inserted. 

For a difference between two sets of scores in the same group, it 

is simplest to tabulate the actual differences, d, between each person's 


two scores (allowing for signs). Then: ( 2 
4 Mı — М, А 
t= ТЕ = XN 


^ NN-1) 
. 
This automatically allows for the correlation between the two 
variables. Б 
u S 
Mote that this modified treatment, though statistically preferable, 
makes very little difference until the N’s fall below about 15, The 
enormous majority of edugational experiments gre carried out on 
D 
much larger numbers. , ” 
Analysis of Variance.—An extension of the"r-technique enables us 
to explore the seliability of differences between several groups simul- 
taneously, instead of«dealing with only two at a time. For exaniple, 
1 Cf, Fisher and Yates (1938). Instead of the actual numbers of cases, the 
tables refer to what are called degrees of freedom, which are normally N — 1. 
Thus for a difference between 43 boys and 10 girls, the degrees of freedom are 
12 +9 = 21. » , 
M.A.—7 “. a 
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we cai test the significance of differences in means or percentages 
between sets of pupils taught by several alternative methods, or 
classes in several schools, or children classified according to several 
grades of parental pccunation. Again tlie groups may be classified 
in two or more directions simultaneously, say by teaching method, 
type of school, sex and intelligence level. Thé relative importance of 
these influences on average test scores, and their statistical signifi- 
cance, can thus be assessed. Nöte that tkis involves an important 
departüre from the model of the classical scientific experiment, 
where the effects of altering dne variable at a time were observed, all 
otüer variables being kept constant. Hence analysis of variance, 
which was first developed by Fisher for studying the effects of 
different fertilizers or other conditions in agriculture, has become 
one of the most powerful and most widely used tools of educational 
experimentation. It has also proved of particular value in analysing 
“examination marks—how far these vary with the examiner, or with 
the question or topic set, or with the school from which the candi- 
dates come, and so on. a 

The detailed techniques are somewhat elaborate for description 
here (cf. Lindquist, 1940), but the following simple example will 
give some indication of their nature. 4 


Ex. 35.—The writer's men students were divided, according to 
their various courses of teacher training, into seven sections, each 
numbering about 22 students. The average level of ability in different 
sections seemed to differ considerably, and their average percentage 
marks on examinations in psychology are listed in the second 
column of the table below. The third columii shows, however, that 
the marks in any one section were very variable : Е 


Section Mean Psychology Mark S.D. of Marks, 
A 72:89 8-25 
B 58:69 8-43 
С» 65-59 7:88 
D | 64-41 7-05 
А Е 6395 ^ 8-58 
- Е 63:52 - , 10:71 
G d 63:48 10:71 


usually ranging from 45% to 85%. Thus in order to answer the 
question whether there exists any relation between a student's ability 
and the section to which he is assigned, we must determine whether 
these differences between the means of sections ere significant or 
whether, in view? of the variability of the marks within each section, 
such differences might arise through chance fluctuations of sampling. 


› 75 
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“Variance” is Simply another term for spread, dispersion, бг vari- 
ability (actually the variance of а distribution of measures is the , 
square of its 5.03: Analysis of variance enables us to say how far в 
these differences between sections сап be ascribed to the effect of. 
individual differences within sections. The résutt i$ expressed in terms 

of the F-ratio, in much the same way %s the significance of differences 
between two graups is Expressed as је (Indeed an F-ratio when only • 
two groups are compared is itlentical with 7°.) In this instance, the . 
F-ratio is 2:93, and appropriate tables tell us that the probability of 
such differences between seven means arisihg by chance іє quite 
small, approximately 0-01. We may conclude that there is a definite 
tendency for better students to be put info some sections, poorer 


students into others. « 
Ny 


2. ° 5 

S.E. апа Р.Е. of a Correlation Coefficient.—The correlation be» 
tween two, sets of measures is likely tobe very unreliable if the num- . 
ber of persons is small, as was shown in Ex. 28 (р. 76). The S.E. ог * 
P.E. of а correlation coefficient tells us how widely this coefficient 
may be expected to deviate from the true value, i.e. the value which 
would be obtained from an unlimited number of persons, The 
formule for these errors will be given іп Chap. УТ. ° А 

S.E. and Р.Е. of an IndividualTest Score or Examination Mark.— “ 
If some pupils or students take a test or examination two of more 
times, or answer two or more supposedly parallel tests in the same 
subject, they will not obtain precisely identical scores on each 
occasion. Henoe any score or mark derived from oply a single test is 
to some extent unreliable. 1t пау deviate more or less from the score 
which would be obtained with far more thorough testing. In other 
words, the single test only gives us a limited sample of the pupils' or 
students’ capabilities. Неге too, then, the conception of the S.E. has 
valuable applications. But the reliability of a score depends, not on 
the,number of persons who take the,test, nor on the wideness of 
dispersibn of Scores, but on the thoroughness of the test itself, or the 
number of items of which it is composed? and on the amount of 
variability in pupils’ answer’ to these items, Elemtntary methods ofe 
calculatipn are given in Chap. УП. A full treatment of the underlying" 
theory may be found in Gulliksen’s (1950) book. e 

Finally, when a test or eXamination is employed for predicting a 


! [t seems to be thé convention in psychólogical literature to employ the S.E. 
in the statistical treatment of averages and the like, and the Р,Е. in the trefitment 
of coryelations and individual seores. There is no logical reason for the conven- 
tion, but we shall usually adhere to it in this book. 


TT 
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pupil’s success or lack of success durirg his subsequent educational 

‚ or vocational career, it is-obvious hat the estimate obtained from 

> the test score will deviate more or less from the truth. The reliability 

of such predictions-naturally depends mainly upon the adequacy or 

validity of the test. It, too, can te expressed in terms of a S.E. or PES 
»and the appropriate methods ` be given Below. ^ 


з 


7 » 


» TESTING DIFFERENCES BETWEEN CATEGORIC 
‚ VARIABLES BY CHI-SQUARED 


in educational and psychological research, there are many types „ 

of data which-we may wish to analyse or compare, but which are not 
amenable to averaging or to the calculation of Standard Errors. For 

. . example, the colour of people's eyes cannot readily be meas red as 
’ а continuous variable, yet we can classify individuals into a n umber 
of more or less distihct groups or categories according to their eye- 
colour, Similarly, we may wish to study the numbers of pupils at 
different ages who go to the cinema 0, 1, 2 ог more times per week, 

or the numbers who answer Yes, Doubtful or No to an item in a 

› questionnaire. Or we may wish to compare how many boys and 
girls are given gradings of VG, G, Fair and Poor in an examination. 


Ex. 36.—A group of 80'male students stated anonymously that 
their voting in a General Election was: 23:Conservative, 13 Liberal, 
and 44 Labour. We shall take it that, in the country as à whole, the 
proportions had been 45%, 10% and 45%. Are we justified in con- 
cluding that the students were more left-wirg than the average, ОР 
might this difference arise just through chance errors of sampling? 
Our approach is to assume that no real differences exist between our 
group and a group of 80 typical voters—this is termed, the ‘null 
hypothesis’, and then to analyse the degree of unlikelihood of the 
difference occurring by chance. 

Inthe following table, F gives the obtained frequencies of students 


F гав ее» © AB 
Conservative . 23 36 2113 169 4:694 
Liberal . 2:42 8 = +5 25+ 3-125 
Labour . . 44 36 4-8 ^ 64 1:778 
5 ОООО So 9591 


in each group or ‘cell’. F, represents the expected frequencies in an 


e 
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equal number of typical voters (45% of 80 = 36, and so оп). We 


(о YR А 
then calculate Chi-Squared (x?) = AE F F "Note that the • 
à е 


. ° ° ‚- 
Е-Е, cofumn тїбї add up to zerd. The Chi-Squared Tables, 
given in most statistical textbooks, tell us that this value, 9:597, has 
a P or probability of close to 0-01, #8. that it might arise"by chance" 
only about once in a hundred similar samples. Hence {Бегео1°а © 
fairly strong presumption that students of the kind we have stidied 
are less conseryative than the general body of voters. Had ouf sample 
been double the size, chi-squared would dikewise pe doubled, and 
the probability would have dropped to less than 1 in 1,000. © 


о ° ве 
The same procedure is used for comparing two or more groups, 
except that we now ask whetller our samples show the same distri- 
bution, not Whether they conform to some predetermined typical “ 
distribution. . E y cnn 


. D 
o 


Ex. 37.—Greups of 75 boys and 65 girls gave the following as 
their numbers of visits to the cinema per week. Is therd'a statistically 
o 


reliable sex difference? $ у 
Е Fe (F.— Бо: 
No. of d Fe 
Visits Boys Girls Total Boys е Girls 
0 5 10 15 8-03 697 «1-14 1-32 
1 33032 65 34-82 3018 • 0:10 0-11 
2 26 18 44° 2357 2043 025 029 
3 Ж Дыл 
4 +0 3 1 a 8-57 7-43 0:69 0:74 
5 1 — 1 — 
3j ы x? = 4:64 
75 65 140 74:99 65:01 а Р:—=-20 


. 


If the null hypothesis were true, we would ехресі the same pro- 
portion of boys,as of ‘girls in each line df the table; i.e. of the 15 \ 


e . 


o 65 
children with 0 visits, 5 x 15 = 8:03 should be boys, and 140 % 
° e 


15 = 6-97 should be girls. Each entry in the Е, columns is similarly а 
calculated. Note that these F,’s must add up to the same totals as 


1 Such tables show the probability of, chi-squared for different ‘degrees of 
freedom’. Degrees, of freedom here means the number of cells whose F's must 
be known in order (о complete the distribution, in this gase 2. For fhe total 
number of students is fixed, 4nd we need only to know the numbers voting for 

‚ two of the parties in order to determine the Ғіп the third сей, 
oF © 


^ 


5 


) 
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the F’s both across and down the way: The last three rows are best 
‚ combined together, since there isxa general rule that no value of F, 
in a chi-squared analysis should fall below 5, 'or preferably 10. 


(P—F,)? .* ә ые В eee 
Sa is calculated as before, but this time for both columns. 
é v 


"The ‘obtained value of,chi-squared, is not significant statistically’; 
our differences between the sexes could occur by chance 1 in 5 times 
in other similar samples. ® а 
This particular problemi might also have been studied by сотраг- 
ing percentages, using the formula оп p. 84. 49:3% of the boys, and 
35-4 % of the girls go twice ormore a week. The С.К. works out at 
1:68, which has a P value of 0-10. Though this gives a slightly more 
favourable result, it too provides no acceptable evidence of a reliable 
difference; and the more detailed chi-squared analysis is to be pre- 
ferred. Another point brought out by this example is that the Р” in 
each cell should invariably refer to numbers of separate and indepen- 
dent individuals or items. A very common error artiong students is 
to say that 26 boys һауе 52 visits between them, and 18 girls 36 
› visits, and so on, and then to apply chi-squared to the table of visits. 
Another useful application of chi-squared is in testing whether an 
obtained distribution conforms to the normal (or any other theo- 
retical) shape. We have зеей that distributions of small numbers of 
measures may depart considerably from normality, and that by in- 
creasing М, the irregularities tend to Ъёсотће ironed out. Here also, 
therefore, the result obtained from a limited sample (namely an 
irregular distribution) can deviate considerably from the result which 
a much larger sample would yield (a regular distribution) purely 
through errors of sampling: and the unlikelihood of this deviation 
can be calculated by chi-squared. 
Ех. 38.—The following figures are extracted from Ex. 20 (р. 50). 
They show in the F columa the actual frequencies of scores obtained 
by students on the Сане! test, and in the F, column the frequencies 


which would be expected if the distribution of scores was a truly 
normal one. The procedure is as before, the smallest frequencies, at 


1 Here the degrees of freedom are 3. The totals across and down the way are 
fixed, thus we need to know only 3 cell entries to complete the table. In the 
following example, there are 8 degrees. For though 11 pairs of categories are 
compared, 3 quantities are fixed—the number of cases, thé mean score, and the 
standard deviation. For further explanation of this rather tricky concept of 

degrees of freedom, see Fisher (1930). 
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Fo Larne ВБ ee Ee 
1 1 Sus 
al o 1 +2 • • • 0.6687 22 
1 © г 9 42 , 044. 2? 
22 197 55° 53 0:4737 Ў 
26 30, ° 4 0.5333 2 
35 41 +6 ¢ 08780 . 
47 « 44 51525106 0:2045 
43 . 40 °З; e 02250 
T РАЯ 28 == 0:0356 D 
18 18 20 0:0000, „ 
10 9 H1 бл! 
3 а 
25 2] ч . 1.5000 
9 — — 
250 250. ТУР ИРА 


the head and 400! of the columns, being grouped together. Ghi- 
squared has a probability of 0-75. We might expeet as great a 


deviation, or greater, in three instances out of four by pure chance. 
. 


Thus the approximation to nosmality is quite satisfactory. 
о . 


Й 


* ^ 
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MEASUREMENT ОЕ CORRELATION 
A very wide range of problems in educational and other branches 
of psychology depends on the study of relationships between 
variables. For example, examinations are set at the end of the pri- 
mary school career, partly in order to test the work that the children 
have їоге, Биё also partly because achievement in English and arith- 
metic is supposed to be related to, or predictive of, future achieve- 
ment in more advanced work, The main object of applying intelli- 
gence, or other aptitude, tests is to show probable ability in various 
fields which are related to the test performances. In many secondary 
schools the pupils are allotted to different classes for general work, 
for mathematics, and for foreign languages, since itis believed (with 
much justification) that abilities in these three types of study are not 
» very closely related, that some may be good in general work, poor at 
matherfiatics, medium at languages, and so on. 

Several different methods, applicable to different kinds of data, 
are available for determining the extent of relationship between two 
variables, Almost all of them yield indices ranging from 0 (no 
correlation) to + 1-0 (perfect corre‘aticn). ‘By far the most frequently 
used are the product-moment correlation coefficient and the rank- 
order correlation coefficient. We will point out later the limitations 
of these two methods, and mention briefly. some of the alternatives. 


D 


Ex. 39.—The following marks were obtained by a class of 36 
children on examinations in history and geography. A glance will 
show that there is some, correspondence; that. those who obtain 
high marks in one tend to get high marks in the other, and that low 
marks in one tend to go with low marks in the other; but that there 
are many exceptions tq this trend. The state of affairs may De seen 
more clearly if the ‘wo sets of measures аге plotted against one 
another. Fig. 21 is called a scatter-diagram ог scartergram. The 
marks һауе both been grouped into classes of 2. History marks (Ху) 
are shown on the vertical axis, geography marks (Х) on the hori- 
zontal-axis. Each pupil’s position in the two exams is shown by а 
tick, For example, Alan’s 16 and 12 are represented by a Дек 
opposite the 15-16 mark in history, and below the 11-12 mark in 


MEASUREMENT QF CORRELATIQN 


Name So «Ха. 

History Geography | X X Хх ха 
Alan seas 16 26° | 16 5 | 8 6. 
Helen Deis d qu 1g. | 220 12 
ODDS | И ере ба TT 1 
ees h inus 4 | 12 ^ 
3. ее 0-5 6 
DEC 18 fg: И оао 
SAN NUR 15 d endi qx RTL ОРДЫ 

edm 379733 вл 
us 12 12 1679 1011032 3 
ho 16 8 ° ба 4. 16 1 
EL 11 10 18 14° | 20) 745 
ERES V 14 СТН ету; IAAT 6 


geography. Four pubils obtai 
either 13 er 14 in geography, 


or cell. 


° 
GEOGRAPHY (Х;) 


ined either 17 or 18 in history and 
hence there are four ticks in this box 


1-2|5-4|5-6|7-8|9-0 11-12 [13 -14 15-617? 8 
ет: | Е A == 
19-20 IP | 1 | [ ° 
17-18 | "и јин“ [u м 
15-161 | АЙ 
15-14 3 A^ | d 
= + EE - 
x Ша || + eja 4! || VALE ce A 
> 9-10 (2 | | 
б 778 7 2f ЕП "nest qoM 
ма ~ ue ul jf 
= 5-6 2771 177% ESAI * TA 
2 1— 
ө Fia. 21.—Scattergram, comparing, two sets of schoel marks. 
. e 


It will be seen that the majority of ticksstend to be 
the dotted diagonal line, 
exceptipns; for instance, 
pupil with marks of 10 a 
dence between the two sets. 
the diagonal, and thé fewer 
far removed from it. Fig. 2 
scattergrams for a‘low correla 
of È 0-799 (the numbers in each 


© 


gróuped along 


Although there are some rather extrem: 
the pupil with marks of 16 and 1, and the 
nd 17. Obviously the closer the correspbn- 


© 


«the more closely would the ticks fall along 
would be the exceptions whose ticks are 
6 (p. 133) and Ex. 49 (р. 113) show the 
бой of + 0-146 and a high correlation 
cell represent the numbers of ticks). 


КЕЛИШИ Ur 


. 
а 


94 THE MEASUREMENT OF ABILITIES 


Only if each pupil got marks which exactly corresponded would the 4 
correlation be perfect and the coeficient Ве + 1-0. Note, however, that ` 
о” it is not necessary for the marks on the two variables to be identical — 
for a perfect correlation to be obtained, since these marks may be | 
arranged on different scales, Ir. Fig. 21 the geography marks are on 
„ а different scale from the histary marks; 10 on the one corresponds, | 
ог is relatively equivalent, to 14 on the other. Consider another in- 
starice: the weights and heights'of a class of children would certainly | 
yield а very high correlation, although they would be measured on 
entirely different scales-»pounds and inches. It is the closeness of © 
agreement of relative scores or measures which governs the size of 
the correlation. , > ; 
„ Very few correlations in educational or psychological investiga- | 
. tions approximate.to + 1-0. Usually there-is only a general trend for E 
» high scores оп one variable to go with high scores on the other, а 
trend which admits of mahy exceptions. Two tests of the same ability; 
e.g. two similar arithmetic papers, may perhaps give a coefficient of 
40:90 or over. Tests or exams in different subjects, e.g. arithmetic 
and English; or history and geography, are more.likely to give 
- » moderate correlation coefficients of + 0:40 to + 0-70. : 
Occasionally one may find negative or inverse correspondence, 
when high scores on one variable go with low scores on another 
variable. For instance, in àn ordinary school class intelligence or 
scholastic achievement and age are often negatively related, since c 
the oldest pupils are rather dull,thevyotingest pupils very bright. 
Negative correlations are expressed, just as positive ones, by indices 
ranging from 0 to — 1-0. DE 
Calculation of Correlations by the Product-moment Method (some- 
times called the Pearson-Bravais Method).—The fundamental 
formula for the correlation coefficient, r, is: 


* 
E 


2 eps Zxyxo > 
Мој © ~ 


Here оу and o, are the 5.0.5 of the two distributions. x, represents 
a score or measure expressed as a deviation from the mean of all the 
X, measures, x, is à nieasure expressed as a deviation from the mean 
of all the Y, measures. x,x, is the product of a pérson's two measures, 
апа 20x, is the sum of all these products. С 

Since the means are seldom likeiy to be whole numbers, the com- 
putation of Ххх, would involve, very troublesome arithmetic. 


2 19 > 
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Hence, just as їй calculating S.D.s, it is usual to resort to the'device 
of the arbitrary mean. Each of fhe quantities in the above formula 
then requires a correction. If x, is a deviation ‘from an arbitrary “ 


< 


mean, o; is no longer, 4 224. О M 
ео 
М М М o 
i е; ? 
| сой оао Я 
(cf. р. 47). А corresponding alteration is matle іп су. N ? becomes 
< в . 
DIX X EE ха” 5 
аха _ 7X. 22% Hence the full formula is: Sede 
OH NOU Е i d 
ә Хаха 22 25% К 
ааа E 


NESEN ° 


"PR (sy Gs) | 
N N N NUS 1 


This same formela applies if the original scores are used, i.e. JS 
and Y, are substituted for x, and x,. The only differente is that the 
arbitrary mean is now zero, instead of being some whole number • 
chosen close to the true mean. " 

Statistical textbooks often provide modifications of this formula, 
or ‘patent’ correlation techniques whiclf are claimed to be simpler. 
The present writer prefers the direct method given here, both because 
it makes use of concepts stich%as the true and arbitrary mean, о, and 
the correctiop required when о is calculated from an arbitrary mean 
—сопсерФ with which the student should already ђе familiar—-and 
because errors in computation are fairly readily detected. Moreover, . 
a coefficient accurate to three places of decimals can be obtained by 
this method if the multiplication and division, and the finding of 
squares’ and Square roots, аге done with a 10-inch slide-rule and/or 
four-figure logarithth tables. 8 5 

» e 


a . 

Ex, 49.—The fórmula will be'applied first to the original table of 
marks, given in Ex. 39, and later to the sanfe marks grouped into 
classes (as in Fig. 21). This first method should generally be employed ° 
only when N is smal}, say less than 50, and when a high degree of 

* accuracy is needed." In the following table, columns X, and X, list 
the original marks. Аз arbitrary means, 14 and 10 are se]ected. 


1 When an electric calculating machine is available, this method is generally 
used with original scores (not an arbitrary mein), even if N is very large. 
оо ° r 


o v 
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s 3 5 
X; Xe x Хе x" Xe 
16 12 НЫ 4 4 
19 15 +5 145 25 25 
Yao TH +3 il 9 x ] 
ORDEN 2) —4> 257 16 49 
18. 14 +4 +4 16 16 
18 2016 +4, +6 16 36 
15-292) 04. 519 -6 1 3 
17 а Б аа а. 9 9 
12, 12 — 2 +2 4 4 
16. 8. e 2 On} 4 4 
9510 1—3 0 9 0 
14 9 0 =1 ое 
10 5 -4 ~ 5 16 25 
14 18 0 +8 0 64 
8 6 -6 A 36 16 
6 4 zd -6 64 36 
18 20 +4 +10 16 100 
14 8 0 —2 0 
12 4 => = 6 4 36 
18°. 14 +4 +4 16 16 
16 2:10 +2 0 4 0 
6 4 —8 —6 64 36 
18 14 +4 +4 16 16 
12 11 Е 4 1 
8 6 — 6 —4 36 16 
20 12 +6 42536 4 
17 11 +3 +1 9 1 
12 8 ED L2 4 4 
Пао) >= —8 49 64 
17 16 43 +6 9. 36 
14 11 0 +1 0 1 
14 ESSER) жеу) 0 4 
12 3 y) 27 4 49 
16 ЕО 4 81 
20 15 6: 5 36 25 
17 6 353 —4 9 16, 
ue з 
№= 36 GE vis -Е72 549 836 
— 56 — 74 
SM e 


+ 4 
25 
28 
16 
24 
6 
9 
- 4 
— 4 
0 
0 
20 
0 
24 
48 
40 
0 
12 
16 
0 
48 
+ 16 
— 2 
24 
12 
+ 3 
+ 4 
E 56 
+ 18 
0 
0 
- 14 
— 18 
30 
— 12 
cA WT. 
- 466 
— 74 
= + 392 
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Columns x, апв ху list the deviations of X, and X; from these A's. 
They are summed to give 2 and 2x2. s 


“ 


Pos Were ees Dy о 

ZA == + 01389 A == — (0355 

+ a y Ф * DE 36 22 ds p 

Therefore їЙе true means are 14-139«andl 9:945. Later we shall need 
9f уха Хау ө 2% Хха . Sur: 


Б а 
These come to 0-019, “+ 0:003, айа — 0:008 respectively. Kote 
that three places of decimals are sufficient for these corrections. 
Special attention should be paid to the sign (+ or E of == х 28. 


The two columns on p. 96, ха? and x3*, give the deviations squared. 


These are summed and averaged. A 
Ext 549 лш Жаз 8369 23. 
шз 1520 с cu ca 
Hence оу = Ju (20) = 4/15-250 — 0-019 = 3903. 


. a= ,/23:222 — 0:003 = 4819 ~ 

The last pair of columns gives, the products of the deviations? the « 
|- and — products аге separated merely for convenience in summing. 
Special attention to the signs is needed. When x, and x, are both + 
or both — the product is plus, when one is 4- the other — the pro- 
duct is minus. Notice how the exceptional cases, sych as the pupil 
with 10 and 1?, make large negative contributions to Ххүха, and so 
ultimately reduce the siz@ of e. The pupils who get very high marks 
on both vatiables, or very low marks on both, such as those with 6 
and 4 ar with 18 and 20, make large positive contributions to Ххх 
and help to raise r. n 


a Жейу + 392 _ 49, x 
AM = T 7 10-889. ; 
The correction, == x = is subtracted from this, but as its sign 
X $ 
was —, it is in this сазе added. . А 
Seas jue УКК ЛЕН h 
Pxax. _ Уб, Xa 10:889 — ( — 9:008) = 10:897 
N ққ 10:889 — ( — 9-008) 5 


АП the corrections indicated in the second formula have been 
made, hence we can finally apply the first, or fundamental, correla- 
tion formula: 2 S 


е 
Бе лү тала а ыы 
eeu Gs = 4819 x 3903 T 


[3 


v 
Ф 


» 
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Thus the coefficient shows moderate, correspondence between the 
two sets of marks, as we guessed from looking at the scattergram. 


» 3 
Product-moment.Correlation between Tabulated Measures.—When 
the measures are grouped into classes, the procedure is essentially 
identical, except that the computations are, of course, performed for 
‚ all the measures іп any one class at one time. We are no longer con- 
Н М} > 3.9 > 
cerned with the deviations of single measures from their mean, and 
thei? products, but with deviations of classes from the mean class, 
and their products. At mo time is it necessary to translate the results 
back into terms of the criginal measures, by multiplying by c (the 
size of the class). ^ 
Ех. 41.—The two sets of measures are first plotted in a scatter- 
am, which should contain, if possible, between 11 and 19 rows and 
à similar number of columns. There is no need to have identical 
numbers of rows and columns. In Fig. 21 (p. 93) the data are, as a 
matter of fact, rather too coarsely grouped, since there are only 8 
rows and 10 columns. Over-cearse grouping has the effect of red ucing 
slightly the size of r (a formula for correcting this may be found in 
advanced statistical textbooks). By adding up the frequencies in each 
row, and in cach column, we get the two frequency distributions of 
grouped measures. Fig. 21 is reproduced below with the frequency 
© distribution of x, measures (F,) written at the right-hand side of the 
rows, and the frequency distribution of x, measures (F;) written at 
the foot of the columns. The frequencies in both these distributions 
should be checked by makiag sure that they add up to N. 


m= -4-8-2-1 O4F149484445/F, 
Oo Peel hl: 
E ЖАКЫБЫ || | 1|10 
ЕГЕР . 
5 
5 6 
i 2 
8 
2 

Бата 4. 2- 1|8в=М 
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An appropridte arbitrary mean is now chosen for the x, classes 
and for the x; classes, and the classes aboye and below these means 


are numbered + 1, + 2,... —912— 2, .. . etc. These numberings * 


are written at the left-hand side of the rows, and above the tops of 
the columns, Standard deviations are now calculated ^n the usy@l 


way. * 

x, and x» Р, N FE, 1-0” Ера | В 9 Fa? « 
еқ T^ 4-5 11 > 
А 2 СЯ 8 

3 3 4 |+ 9 2 21 336 
PU SUE ZR de Mor: 8 40 16 
РЕ ER ҮЙ Syne 45 T 
PEE ТС Е О 
| | — ---- 
с 4° 
Оаро, 4 4 а 16 
3 3 5 а 9 152 ОТ 45 
-4 2 2 | 8. 8 32 22 
96 36] 2007 35:9 145 21% 
s ХЕ ZF xe è А 
= +=7 = 4-5 i: 


FE + 2 = + 0:1944 (25) — 0:0378 


Рур 5 + ХР} 5 9.9193 
QM e aot $139 ( E 0:0193 
The correction E e P mee which will be needed later in the 


numerator of the correlation formula, is: 5 
+ 0:1944 x + 0.1389 = + 0:027 4 


ғ Я ЭР? 145 А. ^ > 
wo 56 = 4028 О 


а ГІ М 
“а. 4005-0039 = 1997 ; 
à Pea EMT > 
~ 6 3 27. 


в. = 4/5917 — 0:019 = 2-429 


Note that these S.D.s are approximately one-half those obtained 
in the previous computation ofr (namely 3:903: and 4:819), the 


reason being that we have grouped the original measures into classes 


во в 
Ж o $19, 
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of 2. They are not exactly half because the grouping has produced 
some slight distortion of the distribution. 

Now to obtain the products, the frequency in each cell must be 

_ multiplied by its x; value and by its x» value. For instance, the entry 


Zin cell (xp — + 3) xa = + 3) contributes 2 Х 3 x 3 = + 18 to 
JFx,x» The entry 1 in сер ба = — 2) (ха = + 4) contributes 
2144х-2--8. 2 9 b 
Cells corresponding to ` Frequencies Total Ех 
XiXe this value of xix»; in these cells F's 
2 ж=-+1 425-2270 ў 1 
0275 0 1 
= 1 $ 0 1 2 
d 0 du 2 1 % 
050 Qd І 
790 +4 И 
- 1 ЖЕ и 1 пе T 
--1 e NN o x 
42 х= +2 х= +1 2 p 2 |- 
а) х= +2 х= +1 1 Ек И 
À 42-1 =3 2 Pa] 
4 m= +2 ж = +2 4 Bee 
^ Wag EI 1° 5 20 
ны Еа 24 
92 х3 X= +3 2 2 | 18 
$ y= 412 XQ +5 1 1 + 10 
y= 3 %---- 4 1 
-4 за 2203052 
j 123 
=] m= +1 х= —1 Duci. 3 
odi Я З ЕЕ 2 
- р х= +1 х= — 3 1 1 3 
хр=-==2 == 2 
ae Tay а к io 
8 x= —2 X=+4 1 1 8 
У * 6 
| - ху = 101 


It is unnecessary, Lowever, to consider сасћ cell іп turn. Instead 
we take each possible value of xx in turn, as shown in the accom- 
panying table. АП the cells for which either x, = 0 ог x» = 0 contri- 
bute zero to YFx,X». It will be seen that the total frequencies in such 
cells is 7. The total frequency in the cells where xix» = + 1 is 2, 
hence + 2 is entered in the last column. 
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As soon as tte student becomes familiar with the method ће will 
find it unnecessary to write out, the second, third, and fourth 


columns of such a table. Only the first, and the last two are needed. ” 


which xx, = 0, and fo count up the frequencres4n thest cells w. 


The simplest plan is to draw а cross through each of the cells fcr 
doing so, and then enter the total in*a table thus: K 


p 


2 xp Е’ х Баж } 7 j D 
ЈЕ 0 ў 
Tob po eis le | " 
} 4 > В 4 ' 


Next cross out the cells where x,x, => + 1, and enter' in tlie table, 
and so on. At the end, sum the F column, in order to make sure that, 
none of the entries in the cells have been omitted, The last column 
is summed'to aive ZFx,xs, and the rest of the working is as before. 
XExx " У 
т = m = 2:806. The correction, (o be subtracted, is 


|- 0.027. : ^ 
2:806 — 0027 = 2:779. =). UIS = 70:573. 


It will Be seen that in spite of 'the distortion due to grouping the 
scores into a rather coarse scattergram, the result is nearly identical . 
with that obtained from the ungrouped measures, namely + 0:579. 


There are, of gourse, numerous possibilities of еггог1п working out 
product-moment correlations.» All, additions should be checked up 
and down; multiplying, dividing, squaring, etc., may be done with 
logarithm, tables and confirmed with a slide-rule. But a'little practice 

| will enable the student to predict r fairly closely from consjderation 
of the scaftergram, and 50 to judge whether or not the value he 
reaches is likely to be correct. 4 
] Ап Alternative Method of Calculating Correlation fram the Scatter- 
gram.—We will describe one of the various alternative methods, 
which may be found simples to use than the abowe, direct, method. | 
| It eliminates the collection of the contributions to ХЕхұху from each? 
| cell, and’ substitutes the calculation of two additional S.D.s, whioh 
A we sháll call оз and о. Actually, one of theses sufficient, but the 
„other acts as а valuable check. The formula for r is: 
gto а: — та" "et oy — 04" 
TET ER ото и 200; , 
‚ Неге 71 and ох are the S.D.s of the Ру and F, distributions as before. 


ГД 


‹ 


МА—8 е Su: 


t 


Хо, о 
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The other two distributions are obtained by counting the frequencies У 
‚ along the diagonals. | -~ $ 3 


‚Ех. 42.—Applying this to the scattergram in Ex. 41: for оз we | 
start from the top'righi-hand corner and proceed to the bottom | 
left-hand, counting all the entries along the diagonal at right angles. | 
to this line as we go. On the diagonal running from (x, = + 4) о 
(а = +3) to (ху = +2) 62 = + 3) there is one entry. The next ОШ 
diagonal runs from Qa 239-3) Qu 4-3) to (x, = -- 198 
(ха = 4- 5); this takes in 2 entries. The next also takes in 2, the | 
next 1 + 4 -- 1 — 6, and so on. The complete distribution, Fs, is | 
given in the following vable. The corresponding distribution Fy, | 
alóng the diagonal from top left to bottom right, is also listed. © 
Arbitrary means are chosen and the 5.2.5 worked out in the usual | 
way. 1 


Fy 


та 


i ва F,d? ға! 


а 
iu 49. 
2 +6 12 72 4 
mane ТІНІ ТЕ 551225510 5 50 25 
6 32: А 24 8 96 32 
3592.0 5-3 6 0 189270 
ТОҚУ АЛЕ ОР) 4 8 8 16 
maso ЕЙ 2 6 2 6 
810. ЗАА) 0 E 
US cente eue ci 8 3 8 
p er m) 4 4 8 8 
15567 13 3 3r 9 9 
ПА 12 4 4816 
ОО 5, 10 0 50 0 
st bagel 2-6 0 6 0 36 
201 =7 21 147 
Beads 560 > 156 
ХЕИ 65 — 53 » DF 2 560 
M ap = 0333 ә = 7g = 15556 


as og = 15-556 — (0-333)? = 15-445 
Similarly о,“ = 4:333 — (0:056)* = 4:330. 
We already know:,3 and o? to Бе 3:990 and 5:898. Hence: 
c 15-445 — 3:990 — 5:898. > 
E 21907249 177 
. The version using оү instead of б gives the same result, which is 
also identical with that obtained on p. 99. ) 


5 » 
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Rank-order Correlation —A different method for determining 
correlations is available for data agranged in order, for example the , 
orders of merit of a school class on two examinations. When the 
number of cases is greater than about 30, the figures involved Бесодве 
troublesomély big, but with а smak number the method is much 
easier to use tham the product-momené method. Often, therefore,sets е 
of scores are turned into rank orders and the correlation between 
these orders is computed.«The formüla is as follows: 

r=1— бед - = 
N(N?— $) ‹ 


where 4 is the difference between each person’s two ranks. Such a • 


correlation is often indicated by the Greek letter (то), instead of 
by r. • . 


е 


Ех. 43.°-Тће second and third columns of the following table с 


give the marks of eleven pupils on tests X, ard X}. The next two 
“ ‹ 
a 


Pupil Х Х, Ranks d у 
AC 5209 75 6 4 2 4. 
В 85° 66 1 7 6. 36 
C 64 58 7 9 2: 4, 
р 78 82 53 1 2 4 
E * 60 58 8 9 1 “1 
Е 70 80 5 24 24 6:25 
С 83 80 2 712% 4 0:25 
H 55 71 10 5 E 25 
I 58 58 11 9 brates 4 
J 72 5 е ~ 11 7 49 
K “759 68 9 6 3 9 
г € LIES 
.. 1425 


columns show the ranks, pupil А being 6th оп Х, 4th on X, and 
so en. The d column gives the differenees in rank, regardless of sign. 
These dre зайаге and summed. Substituting in thesformula: 
'6 Х 1425 4 
= 19 5 — 1 0.648 = 40352 
Ес t i 
C learly the bigger the differences in rank positions, the bigger will 
Xd? be, and thé lower the value of r. If 62d" is gfeater than N(N*—1) 
the correlation will Ъё negative. у | 
Now there is no difficulty in turniñg scores into ranks when, as in 
x, in the above example, every person gets а different score. But 


when two (or more) persons tie with the same score, they then divide 


. . 
° . 
^ 
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the two (or more) ranks between them; as in Xz, Thús pupils F and 
„б must share 2nd and 3rd place, and are both called 23, whilst the 
> next highest pupil,"A, is called 4. Again, С, E, and I share the 8th; 
9th, and 102 positions, and are all called 9, whilst J, the next highest, 
is 11. Such sharing is undesirable in that it distorts, and generally 
decreases,-the size of the correlation. It folldws, therefore, that the 
renk-order method should not be applied when there are a great 
тапу ties. $2 з 
OTHER METHODS, OF MEASURING CORRELATION 
‘fhe nature and the main uses of other methods will be indicated » 
briefly, batthey will notbe described in detail. Full accounts of them 
тлау be found in more advanced textbooks. 
Spearman's Foot-rule Method.—This, like the rank method, is 


M. applicable to data arranged in rank order. It is perhaps even simpler 


than the ordinary rank method, but the coefficients which it yields 
range only from + L0 to — 0:5, not from + 10 to — 1-0, and they 
often diverge rather widely from those obtained by the first two 
methods. A new and easy method, which possesses certain important 
statistical advantages over the rank method, has been described by 
Kendali (1948). 

Correlation Ratio.— This is denoted by the symbol » (eta), instead 
of by r. We saw in Fig. 21 (р. 93), that the ticks in а scattergram tend 
to be grouped along a straight diagonal. This is referred to as "linear 
regression.” When the regression is nor-linéar, that is, when the ticks 
tend to be grouped round a curve, а product-moment correlation 
fails o show the closeness of relationship, and the correlation ratio 

‚ ig used instead. For example, when tests of motor perseveration аге 
compared with assessments of desirable character traits, it is some- 
times found that both the persons with high perseveration scores 
and those with low scores are weak in character, whereas those with 
medium scores get the ‘best character assessments. This type of 
yelationship is illustrated in the accompanying scattergram, Fig. 22. 
” Coefficient оў Contingency.—Ordinaty correlation methods, cannot 
be’ applied to categoric variables such as those described on р. 88, 
nor to grossly non-normal distributions. We can nevertheless analyse 
the association between, say, hair and eye colout, or between religious | 
affiliation and socio-economic grade, etc., by an extension of the chi- 
squared technique. The individuals are first classitied into convenient 


numbers of categories for cach variable, and a cross-classificatioD | 


ж-, 
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CHARACTER RATINGS ——* 


MOTOR PERSEVERATION SCORES => 


Fic, 22.—Scattergram of non-linear correlation. ч 


table—-similar to а scattergram—is drawn up. We then asstime that 
there is no relationship between the variables and compute, on the 
basis of this hypothesis, how many individuals would be expected to 
fall within сабһ of the cells. If the actual frequencies diverge from 
these expected ones to dh extent*which cannot possibly be ascribed 
to chance errors of sampling, this provides evidence of a relationship 
betweeh* the variables. . 
Ex. 44.— Records of ће times spent on television viewing over * 

one week were kept by 293 grammar school pupils, who were also 
classified according to the number of years their fümilies had had 


television séts. Does the following table indicate that pupils from 
families which have owned sets for a long time view less than 


relatively recent owners? ® , E 

о. œ e — Length of Ownership T 

< Less than 152 3years | Ж 

% 1 year years* and over | 

Т: e — ке 

Hours of 8 +° 14 31 41 | 36 

viewing 4-1- 14. 49 66 | 129 

inl мек 0,-3+ 55 21 52 K 78 

3 а 159 293 
ел 
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Noté that there is no need to have the same numbér of categories 

for both variables. Quite broad ones are generally adopted, so that 
27 no value of Е, falls much below 02 

Now if there were no relation between the two variables, we 

тїрїї expect the 86рарів with 8 + hours viewing to be distributed, 

as regards their ownership periods, in the proportions 535101 7.159; 


"That is, the Буз for the first line of the table should be: 86 > д 
Y n у = 
86 x iP and 86 x KE = 9-7, 29-6, and 46-7. The other eniries in 
"the F, table are similarly calculated, and the F — F, table does 
. "suggest that a certain excess of short-owners view for more hours 
than long-owners. : 
АЗОТ 7296467. | 66 | . s 
145 445. 700 129» = | 


88 269 423 | 78 | +43 +14 57 | 
590568 05 +45 —40 | 
101. 159 | 293 38 —59 +97 
"Таз оғ Е | TABLE OF (F — Fe). 


Bana уле | Chi-squared = 8-55, ‘and as there 

1915 0:07. 0:70 | M degrees of freedom (cf. footnote, 

Ё à Я | р. 89), this has а probability of 0:07. 

1:64 Буду 94 үзе Such odds are not high, but they аге 

EBENE suggestive of a slight negative corre- 
lation between amount:of viewing and 
length of ownership. 

This chi-squared can be converted into what is called the con- 
tingency coefficient, C. A further modification yields a coefficient 
which is more closely comparable to a correlation, г. 

c= fa EL xa 
122878422552; kik М х df. 
Here, d.f. refers to degrees of freedom, and К provides а correction 
for small numbers of categories. 


5 No. of Ae Seay 
р Categories 

5 4-2 0-798 
£3 4 0:872 

4 0.923 « 
5 0:949 

~ 6 0-964 Iis 

8 0-979 
10 0-986 
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In our exampie, the pupils аге grouped into 3 categories Оп each 
variable, and kı = К» = 0:872. Thus: « 


C= Је RN КЕЛЕ и. 
2 м 293 855 ** 08724300135 —4 ^ 

= 0-172 8—«0:163. t 

° y “ 5 “ Mura 

Considerable caution is éssential in interpreting contingency, 

since strictly it is a measure of divergence rather than of relatioy ship. 
Suppose, for instance, that our original tabfe of F’s had been: 

© 


ç EN, 


БИ» ЕЕЕ ЕС: X 
|66 44:5 ОАО МӘ Зы 
[m 5-9 T. Е 
i Е . 


о 


C and rg would be the same as before, although the relationship 18 
quite different; for now it is the middle-ownership group which 
views most, and the short-ownership group which views least. Thus 
contingency coefficients never have a negative sign even though, as 
in our. example, they happen {о express a negative correlatjon. 


Two by Two Tables.—Ex. 45.—Taking the data from Ex. 39, 
suppose that we do not know the pupils’ actual marks but only the 
categories ‘top half’ and ‘bottom half? on the twó examinations.* 


The following table maysbe drawn up: с 
T GEOGRAFHY 
ғ f . 
í ^e Top Half | Bottom Half “ 
, И or over 10 or under 


Top Half 
15 or over 


Bottome На! 
10 ог over Е * 


e 
e 
эз 


© 
° 


© 
There are seyeral ways of determining a corrélation for such data. + 
Two that have often been applied are Yule's coefficient of association 
(Q), and his coefficient of colligation (o). 
1 There is no necessity for the proportions to be equal, as they afe in the 


present instance. But coefficients are more reliable the nearer the split is to the 
median. t 


t 


€ «€ © 
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; ad — be 
20 xx ad — bc 


where а, b, c, and а are the frequencies in the four cells. Substituting, 
we find Q= + 0742. Neither of these coefficients is aj all closely 
comparable to a product monient correlation, and contingency is 

‚ therefore somewhat preferable. Бог the same table, x? = 7:111; 
hence C = 0-406, and ry = 0:598. The latter figure compares quite 
well with the original nroduci-moment ғ of + 0-579. 


- A still more suitable technique is known as tetrachoric correlation 
(r). This cannet easily de calculated, but it can be read off very 
quickly from. specially prepared graphs (cf. Chesire et al, 1933). 
Note, howevér, that r, can be used only when the variables to be 
compared (although both expressed in the form of two categories) 
can be regarded аз normally distributed. It therefore applies to our 
examination marks, but could not be used for sex, or eye-colour, 
and the like. In the present irstance, ғ, = + 0:642. The discrepancy 
between this and our г of + 0:579 is probably due to the small 
number of cases and the irregularities of their distributions. 
Biserial r.—When one of the variables to be compared has been 
accurately measured and has yielded a normal distribution, but the 
other (though believed to be normally distributed) can only be 
-expressed in the form of two categories, biserial r may be used. For 
instance, if we wish to compare intelligence-test scores with success 
or failure in a scholarship examination, we may “not know the 
examination marks, but we do know the ifitelligence scores of 
scholars and non-scholars. 
Ex. 46.—Assume that the pupils of Ex. 39 are divided into two 
groups according to their history marks—the top 17 and the bottom 


19. The mean geography marks of these groups are 12-23 and 7:89 
respectively. - ы i 


i e 


М, Ma p М^Мз„Р 
= Е 


с 2 


М, and M, are the means of the contrasted groups. In the alterna- 

tive formula, M is the mean of the total group. c is the S.D. of the 

total group, namely 4-819. p and q are the proportions falling into 
3 17 У 

the groups, 36 and = zis the crdinate of the normal curve corte- 

spore А р (obtainable from statistical, tables); in this instance 

22-29 Қ 
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Я 12:23 —,7-89 x 02490 _ 
T 4819. | 0:3977 1 
Though this result agrees well with our product-moment 7, rji; >. 
tends to be somewhat unreliable and easily distorted by érregularities 
in the distribution. A өс o 
$ ы © ° ? o 

THE PROBABLE ERRORS бе CORRELATION COEFFICIENTS, 

A correlation coefficient is, like any other numerical psychological 
result, liable to,vary on account of sampling errors, especially when 
the sample of persons, whose scores ate inter-correlated, is of small 
size. The formula for (һе P.E. of a product-moment coefficient i$: o 


0.6745 — 1) ДА 

(RIPE рә ЕД УМ У “ 
Applying this to our ғ of + 0:579, where N = 536,»we get Р.Е. = 
+ 0.075. It is usual to write the Р.Е. after the goefficient, thus: r = 
| 0.579 + 0:075. The Р.Е. of a rank order coefficient is stightly 
greater than that of a product-moment one, and is usually calgu- 
lated by the formula: ° 


+ 0:563. 


0-7063(1 — ғ?) : > 
= ` . p ==< VN ^i 

For the determination of the P.E.s or S.E.s of the other coefficients 
mentioned above, the reader should cónsult more advanced text- 
books, such as Kelley (1924) or Guilford (1936). ' E 

The connotation of thè PÆ. ef г or p is similar to that of an 
average or difference (cf. Chap. V). Thus + 0:579 + 0:075 signifies 
that if the same history and geography papers were set to much 
larger numbers of similar children, and the marks inter-gorrelated, 
ihe true walue of r so obtained would be as likely as not to lie 
between -+ 0-654 and + 0:504 (i.e. r + 1 x P.E. andr — 1 x Р.Е), 
but that there is a 1 in 20 probability that it might deviate as widely 
as + 0-804 ог + 0454 or more (3 x P.E). * E 

Just as a difference betwéep averages should be two or more times 
its S.E, to be accepted as statistically significant, so a соптејаноћ 
should be 3 times its Р.В, before it is regatded as showing а real 
relationship between the inter-correlated variables. A coefficient of, 
say, + 0:30 + 0-10 45 on the borderline of significance. For if the . 
true value of г was zero, the scores: of a limited sample of persons 
might yield + 0:30 once in 44 times. Я А 

In our previous illustration + 0:579 is much greater than 3. x Р.Е. 


се ° 
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Yet, as we haye seen, this value is far. from trustworthy, the reason 


; being the small size of N. Most investigators consider a correlation 
calculated from 105$ Бап 25 cases to be almost worthless because 


ef its low reliability, and they prefer to һауе at least a hundred cases 
before they base any impürtent conclusions on the results of an 


2 experiment involving correlations. | 


‚ When numbers are small, it is better to calculate / instead of the 
Р.Р, 3 т, 
Ў Ж УМ — 2 
a 2 1-І 
For г = 0-30 and N = 36, t = 1:922. Thus this coefficient barely 
reaches*the 0°05 confidence level. 
а Fisher's z Method.—The Р.Е. method of indicating the reliability 
of r is inaccurate,'not only when N is small, but also when is large, 
exceeding say 0-40. Fisher (1930) presents a better method which is 
generally adopted nowadays. It consists in transforming r into a new 
quantity called z, whose true S.E. is easily calculated. Applying 
this method to our r = + 0:579, we find that the liinits within which 
true г is as likely as not to lie are almost identical swith the limits 
derived, above, from P.E.,, but that the 0-05 limits should be + 0-762 
and + 0:310 instead of + 0-804 and +- 0-354. 

Combining and Comparing Correlations.—The z method also pro- 
vides a useful way of combining two or more correlations. 


„Ех. 47.—Suppose that anothes class of 43 pupils had taken the 
history and geography papers and yielded a coefficient of + 0:450 
+ 0-082, what is the best estimate of the true correlation? The pro- 


‚ cedure is to determine z, and 2, for ће two r's, and then tô compute 


the weighted average, 
2 — 3) + z(N, — 3) 


v ON; T Na гіб 
Translating back from the average 2 gives us our answer, Қз 
+ 0:512 + 0:057. 2 г 


b) 


7 It is often required to find whether a correlation in one group is 
appreciably different from that in another group. or whether the 


difference can be ascribed to errors of sampling. When the numbers 


of cases are large the stock method may be used, 


wi 5 
220 ios. (4 0) — loge (1 — э} The SE. of z= JN 7 Fisher 


provides tables of z for various values ofr. 
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*- og =4/S.E.r? + $.Ё.гь*. : $ 
Ex. 48.—What is the significance of the difference between ` 
L 0-579 + 0:075 and + 0-450 + 0-082? The S.E.s of these с0- 
P.E.S. . чады 5 
06745 it оо 0:1216. ош = 91645. The 5 
difference is 0-579 — 0-450 = 0-129,"which is ‘not as large аз its own 
са. A difference as large as or larger than this would occur 43% of 
times as a mere matter ofchance. ° ° 5 


9 
efficients аге the 
8 


© 


Аз the numbers are small, Fisher’s z‘method should preferably be 
» used, This consists in computing 21, 2, and their S.E.s, and applying 
the stock formula to these S.E.s. The difference in z's is 4- 0:1763 
іп the above example. S.E.z, = 0:1741, S.E.z, = 0:1581. от 
4/0-17412, + 0-1581* = 0-2352. Here the difference js 0-75 times its 
c,, and it might occur 45% of times as a mere matter of chance. 
Note that these methods do not apply to the difference between two 
correlations in the same group of people (cf. Lindquist, 1940).. 5 
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b ay CHAPTER VII 


"INTERPRETATION AND APPLICATIONS OF 
А . CORRELATIONS 


А CORRELATION coefficient is never an end in itself, but always а 
means towards the proper interpretation of numerical data, By 
studying the correlations between tests of various abilities it is 


possible to analyse scientifically the nature of such abilities. This is 


one importaat application, which we shall consider in Chap. ҮШ. 
But the primary value of a correlation is that,it enables us to make 
predictions. If abilities Х, and Х, are known to be inter-correlated, 
‘then we can use measures of X, to predict scores on Ху, and vice 
versa. For instance a pupil’s probable scholastic success may be 
forecast froin his performance on an intelligence test. This cannot, 
of course, be done with complete accuracy, but the correlation со» 
efficient will-also tell us how accurate is the forecast, or what are its 
limits of error. S 


“ 


Ех. 49.—The following scattergram shows the Intelligence Quo- 
tients of 440 children and: adults on the Stanford-Binet scale, ani 
their 1.0.5 on the Vocabulary test from this scale, (The group ог 
«category 140 inctudes all testees with I.Q s around 140, i.e. ranging 
from 135} to 1454; similarly with the other groups.) Here the cor- 
relation is quite a high one, namely + 0:799 + 0-012. Hence we 
would be justified in testing people only with the Vocabuiary test, 
and predicting from their results what would be their Binet I.Q.s. 
ТЕ үш how to make such predictions, and how good they 
wo е. 


Suppose a child to obtain a Vocabulary 1.0. of 120, his Binet 1.0. 
may, according to the scattergram, lie as high as the 140 class, ie. 
up to 145, or as low as the 100 class, i.v. down to 96. But presumably 
its most probable valve will be the average of ‘the column: ^ 


Sy ЈЕ 
140 urs 
130 12 
120 5 16 
110 9 


100 8 
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4; X, = Vocabulary 1.9. t 


LI 
12 1301" 
27 1005 
48 115.6 
71 1076” 
100, 49-3 
90 912 
54 824 
26 740 


7 651 
^. 


AX, = Binet IQ. 


Average 
xi 


69-2 15-0 |88-0|01.9|09.9]109-9]118:5 121-64128:3|138:0 
ы “ 


This works ош аё 118-5. Similarly for а child whose Vocabulary 1.0. 
is 70, the most likely value of his Binet LQ. is 75:0, though it may 
lie anywhere between 56 and 95. 

Along the bottom of the scattergram are given the averages of the 
columns, i.e. of the X, measures which most probably correspond 
to the X, measures at the heads of the Golumns. Fig, 23 is a graph 
of these figures. It will be seen that the points on the graph lie 
approximately on a straight*line. which runs through X, = 100, 
X, = 100. That they do not all lie exactly on the line is due merely 
to irregularities in the data. For though the numbers involved in our 
example are large, they are still not large enough to yield really . 
smooth normal distributions, or regularly spaced averages. 

Now as we approach the extreme values of Ху, thecorresponding 
values ‘of X, tend to lag behind; 118-5 is smaller than 120, 138 is 
much smaller than 150. Similarly with values below 100, 75 is not so 
low as 70, 69 is much less'low than 60. This isan important and, 
characteristic feature'of all correlations. It is known as the regression 
of X, towards the mean., The line drawn through the points on the 
graph is called a regression line, or а graph slowing the regression 
пох On Хана RO 

Suppose now that the correlation'is a perfect one. The scores on 
X, and X, would be identical, and the slope of the regression line 
‚ would Бе 45°. Next suppose that there is no correlation at all 


oe ° 


E 
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60 - 5 
> | | 0 үзеді | ШЕЛ) | | 
| қос 60 70 80 90 100 по 120 150 140 150 
D 7 Ху 
Ею. 23.-Graph showing the most probable values of Х for testees who obtain 
each value of Х,. 


between X, and X;. It would then obviously be impossible to predict 
scores on one variable from those on the other. If we took successive 
columns of the scattergram and averaged them, we should find that 
all these means lay close to the mean of Х,. Whatever, the value of 
the Vocabulary 1.О., the average Binet І.О. would be round about 

_ 100. In other words, the regression line would be horizontal. From 
these two hypothetical cases we can now see that the nearerthe slope 
of the regression line is to 45°, the higher the correlation ; the neaper 
it is to horizontal, the lower the correlation. Furthermore, when the 
corfelation is perfect, there is по regression sowards the mean, since 
each X, value is as big as each X, value. But when the correlation is 
zero, the regression is complete, since whatever.the value of Хз, the 
value of X, is always tlie mean. Thus the greater the regression, the 
nearer the correlation: approaches zero. · 

It has been assumed in the previous paragraph that the X, and Xs 
measures are expressed in equivalent units. If the dispersions of the 
two distributions.are very different, then the slope of the regression 
line corresponding to perfect correlation would not, of course, be 
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е, ^ 

45°, However, 25 shown in Chap. IV, measures сап be turned into 

equivalent units if they are expressed as deyiations from their means, 
в 


and divided by their standard deviations, i.e. as 21 and 22, If this fs 

» о . 4/41 е: “ 

done the slope of the regression ling is identical with r. When the 
© A 


д 2 rd e . . 
slope is 45°, ps = 1.0, that i$ perfect correlation. When the slope 
2/72 е . • 


[ . 
5 М хујо; s 3 ЄЗ 
is horizontal, D — 0-0, that is zero correfation. 

52198 . 


» Тһе slope of she regression line plotted in Fig. 23 is 0:80. о; айа 
оз are 17:49 and 17-84 (i.e. the Vocabulary I.Q.s are Slightty more 
0:80 x 17:84 n 
9d = 0815: 
This is very close to the figure abready quoted fer the correlation, as 
determined by the product-moment méthod. As a matter of bistor- 
ical fact, г was egiginally determined by plotting the regression line 
and measuring its slope, before statisticians invented she product- 
moment method. The latter method is always employed now becawse, 
as we ћауб seen, the regression line obtained by averaging the 
columns is somewhat inexact, even when N is large. It is more useful 
to reverse the historical procedure, to galculate оу, оз, and r, and 


spread out than the Binet I.Q.$). Hence r — 
> 


; ; D X40: 
from them 10” determine the regression line. Jp 7, then 
Фо o X901 


х= та, "This is the algebraic equation of the regression line, and 
s729 ы 

it сап be used for predicting any individual’s most probable xy score 
from his x, score. Alternitively, if we wish to predict in terms of 
original measures, instead of measures expressed as deviations from 

thelr mgans, the formula becomes: ; 
вх, + M) = (X, — Ma) 5 

С . . [БА 

. . 
Ех. 50.—What 5 the most*ptobable Binet LQ. corresponding td 
a Vocabulary 1.0. of 140? According to thé ayerages listed at the 


foot of the scattergram, (Ме answer is 128-3. But this is based on only 
6 cases and is therefqre very unreliable. Applying the formula: + 


^ 17-49 
X, = 100 = 0-799 (146 — 100) x 1784 М 


4 X, = 131-3 fs the correct answer. 7 
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Regression of Xs on X.—S0 far we have dealt оп:у with the pre- 
diction of X,, knowing X;. The opposite is equally possible. On the 
У“ right-hand side of the scattergram in Ex. 49 are listed the average 
values of X, which, constitute the best estimates of the Vocabulary 
I.Q.s of persons with various Binet I.Q.s. Here, (оо, there is regression 

s towards the mean. The. 


- Po Regression, ; м 
Xi E of Xa uy probable Vocabulary 1.0. 
X 27 corresponding to a Binet 


Regression Т.О. of 120 is 115-6, and 
of X0n% {ор a Binet 1.0. of 70 it 
is 740. А second graph» 

may therefore be drawn, 

showing the regression of 

' X, оп Ху. Fig. 24 shows 

the two lines diagram- 

Р 4 X; matically. If the correlation 
‚ Fic. 24.— Regression lines when r = was perfect the two lines 
us would coixcide, both hav- 

У ing a slope of 45° (provided, 
MORI always, that X, and X, are 
= measured in equivalent 
units); If the correlation was 

У. zero, the second line would 
Rearéacion be vertical, as in Fig. 25, ^ 
of Ху оп Xi sinte the only possible es- | | 
timate of Vocabulary 1.0. 
from Binet LQ. would be 
100. With an intermediate 
correlation, such es our + 
0:799, the second line 
7 makes the same slope with 
the vertical as does the^first regression line with the horizontal; 
and the two cross at the means of X, and Ху Thus the alge- 
traic equation of the second line, wnich is 2150: the formula for 


у 


5 x 
Fic. 25.—Regression lines when / 2 бо. 


predicting X, from X,, will be: x4 = TAG, or, in tezms of scores, 
T 
1 
(X, — My) = қХ- м). к 
Ў 1 3 
Ex. 51.—What is the probable Vocabulary 1.0: of а testee with | 
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a 


| Binet 1.Q. 80? Х, = 100 + 0:799 (80 — 100) х = 83-7. 

| * У 

| The estimate based on the scattergram was 82-4, which agreés 

| quite closely with our answer. OA = L Же 
Г) а 2 


| .. : 

Р.Е. of Estimajed Sceres.—The regression equations wjll telf us « 
the most probable estimate of X, from Ху, or of X, from Х,. But it 
is obvious from the scattergram thatthe actual X, may fall within 
quite a wide range on either side of the estimated X;, arid that 
similarly an estimated X, is only the avegage of a wide range of 

2 possible Ху. Тһе ranges may be determined from the formule: * 


P.Bvestimx, = 06745 УГ“ * 


© 


Р.Б, 506745, V 1 — r* 25 
Applied to-Exs, 50 апа 51: P.E.cstim.x, = 0:6745 x 17-494/1 — 0-799% 
= ТРЕ i 3 5 


Тһе Р.Е. or S.E. of an estimated scóre has a similar, meanihg to 
those already considered, as may be shown by the following example. 
Consider a very large group of persons whose Binet 1.0.5 are 110, 
and whose estimated Vocabulary I.Q.s аге 108-1 + 7:3. Then one 
half of them should have Vocabulary I.Q.s lying between 119-4 and 
100-8, and all but 1 in 100 should lie within a range of 2:58 x S.E., 
i.e. about 4 x P.E., or between 136 and 80. 

Actually we already know the Vocabulary I.Q.s+of 71, persons 
with Binet 1.0.5 round 110. This is not a very lasge group, but it 
should serve to test out our prediction. The scattergram is too coarse, 
but. looking back to the original scores it appears that 37 of them, 
or 52%, Над Vocabulary I.Q.s between 115 and 101 (inclusive), and 
that the two extreme Vocabulary I.Q.s were 138 and 80. This agree- 
ment is quite close. 2 

Is is important to note that ће Р.Е. of an estimated score de- 
creases as r increases. If the correlation were perfect there would: be 
no error at all in estimating, X, from Ху, or X, from X. The lower 
the correlation, the less accur§te will be the predictions. H 


4 а 

Ex. 52.—As ап additignal illustration, take the results from the 

36 history and geography papers. Неге N is far (бо Small for accurate 

regression lines to besobtained by averaging rows or columns. But 

* by the use of theeformule given above, probable geography marks 

—- сап be estimated from history marks, or vice versa. For instance: 

what geography mark (X;) corresponds to a histery'mark (X) of 7, 

and what is its P.E.? ја e 

7 мал ао 


е e 
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M, = 14139 М, = 9:945 
о; = 3:903 "о: = 4819 


unir г = 40579 

; me dS ‚„ 4819 

— 9.945) = 0:579(7 — 14-139) X 3.903 
сЕ e ) =057% 39) X 3993 
bom X, = 9:45 — 5-103 = 4:84 

ru P.Brestim.x, = 06745 Х 4819/1 — 0:579* 
DAS » 2:65 ў 


Therefore Ху = 4:84 + 2:65. 

› Here the correlation is only moderate, hence the estimate is poor 
in reliability? AJJ we can claim is that a pupil with a history mark of 
would be very unlikely to get a geography mark as high as 15 
(X, + 4 x P.E.); and would, most probably get round about 2 to 7. 


» A INTERPRETATION OF CORRELATIONS 
* In the light of the above discussion it is possible to give а much 
more precise meaning to the conception of corrilation than here- 
tofore. A correlation does not mean, as is sometimes supposed, the 
percentage agreement between (Не correlated variables, In the 
scattergram on p. 113, for example, only 374% of the testees get the 
same I.Q., or rather the same class of I.Q., оп Binet and Vocabulary, 
although the correlation is + 0:799. It is possible to interpret а 
coefficient in terms of the number of factors or elements which are 
common to the two variables, which therefore cause them to cor- 
relate with one another, but there is little practical point in so doing. 
What a correlation does indicate is the amount of reductioh of error 
in predicting scores on one variable from scores on the other variable. 
When r is zero, the error is at its maximum, namely 06745. Pre- 
dictions based on X, might fall anywhere within the whole range of 
X, scores, and’ would be no more accurate than drawing X, scores 
out of a hat, With a correlation of 1-0 the error js reduced to zero. 
3 The extent of reduction of error for intermediate r’s is shown in the 
following table. The second column lists values of 100 (1 — 4/1 = г), 
and so represents the reduction in percentage terms. These figures 
агг sometimes referred to as the forecasting efficiencies of the r's. It 
will be seen that the forecasting efficiency only becomes high enou gh 
to bo of great practical value when г is in the neighbourhood of 0:8 
to 0:9 or more: Great caution should be exercised therefore before 


predicting, say, success at a job from aptitude tests, since these, 


у 2% 


Q 


INTERPRETATION AND APPLICATIONS 119 


usually give correlations of only + 0-4 to + 0-6 with what we wish 
to predict. Such deductions are only 8% to 20% more accurate than, ‹ 


pure chance guesses. AER ep: 
= iet e e € Б 
| 3 : Reducfion of error, or fore- 
E casting efficiency % | 2 
3 000, «“ 00 ы 
0-10 0-5 
0-20 “ 20 
0:30 46 ^ : 
0-40 e 84 / à 
0-50 134 | 
« 0:60 20:0 E 
0-70 95286 a * 
«0:80 40:0 to < 
090 * 212864 4 қ 
0-95 68-8 f 3 < 
$ 1-00 ‹ 100-0, 


An alternative method of interpretation is suggested by McClelland 
(1942). Suppose®we are selecting the top 20 of a group of 100 can- 
didates, either for secondary schooling or for a job, oh the basis of 
their intelligence test or selection battery scores. We know that the 
test gives à correlation of r with later success. What proportion of 
candidates is correctly allocated, or how many do we wrongly select 
and wrongly reject? The position for various levels of r is shown in 


the following tables. | : ; 
? е ә 


* С 


S cess- Unsuc-| | 
, ful later cessful | | 
| дыт: | 
Pass test | 20 ^ 0 20 4 16] 20 14426) 20% 
Fail test? 0 80 | 80 16 64) 80 6 74| 80 
= “з |. 


© 


= 
« 20 80 | 100 20 80} 100 , 20 80 | 100 
г = 100 • r = 000 = 0:85 > 
In the first table ail the Бе 20 on the test turn out well and the 
bottom 80 badly; correlation is perfect. In the second, r is zero and 
our selection is no better than picking names ‘out of a hat, since as 
| + large a proportion of (һе 20 who pass do well later as of the 80 who 
| fail the test (namely 20% of each). Nevertheless, 64 апа 4 are 
correctly allocated, and the proportion of errors is 32%. The third 
situation represents the typical situation in secondary school 


> 
3 


a 
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selection, where the battery of intelligence and attainments tests cor- 
relates about 0-85 with later school work. The number of correct 
allocations is now 88, and of errors 12%. (Note, however, that nearly 
one-third ofthe 20 Passes who actually get into the grammar school 
gain unsatisfactory marks lates.) Thus we have reduced errors by 


232-7 12 3 624%. Here are tht: corresponding redüctions for other 


levels of r. 5 , > 
b) 
2 Reduction of errors 
p in prediction % 
` 0-00 0 
ә ә, 0:10 , 6 
. S 0-20 12 
Y 0-30 : ју > 
МУ 20:40 ха 23 
0-50 30 
2 "060, - 37 
: 2 0:70 46 
» 0-80 57 у 
A 0-90 70 9 
° 0-95 80 5 
- 1-00 > 100 


The Influence upon Corrélations of Irrelevant Common Factors.— 
A positive correlation which is statistically signifigant (i.e. 3 or 
preferably 44 times its Р.Е.) always indicates some kind of causal 
connection, or some common factor or factors running through the 
two variables--even when the connection is not strong enough for us 
to be able.to forecast one variable from the other. Considerable care 
is needed, however, in interpreting such a connection, for its real 
nature may be quite unsuspected, or the common factors may be of 
an entirely irrelevant kind. Suppose, for example, that we correlated 
the size of boots worn by all the children in a school with the speed 
of their handwriting, we should undoubtedly obtain a moderately 
high coefficient. Yet it would be absurd to say, that big boots cause 
quick handwriting, or Vice versa. The real reason for the correlation 
is an irrelevant comrhon factor, in this case age. The older children 
on the whole write more quickly and have*larger feet than the , 
younger ones. If we took childrén all of the same аре and compared 
boots with handwriting, we зћоша be likely to find that the corre- 
lation had fallen to zero, . > : х 


2 » 9 


• known, 
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Consider another, rather fnore complex, illustration. A positive 
correlation exists between backwardnes$ in school work and un-« 
employment among the children's parents. Idealists may at orice 
jump to the conclusion that unemploymefit krifigs about malnutri- 
tion and anxiety among the children; arid that these make theirwork 
fall off. But here also there is an important common elenient which © 
goes some way to explain the connection.. Unemployed parents afe 
known to be on the whole less intelligent than employed ones, and 
unintelligent parents tend not only to ћауе unintelligent children, 

. but also to give them less educational entouragement than do in- 

_ telligent parents. These factors—low intelligence and generally un- © 
favourable environment—are far more important causes ‘of back- 
wardness than is malautrition, We need to take two groups of chi- 
dren, ones with parents employed, ойе unemployed, groups whose $ 
average intelligence leyel is identical, and then study their school 
work to see whether the former do better than the latter. Only» when 
we have ‘held the intelligence factor constant’ in this example, or 
‘held the age factor constant’ in the previous example, can we 
legitimately deduce some causal connection. ў c 

Partial Correlation.—The aBove type of problem or difficulty in 
interpretation is constantly occurring in psychology and education. 
Sometimes it can be met by applying what is known as the partial 
correlation technique. For this it is necessary to obtain measures of 
the suspected irrelevant factor, which we shall call Ху, and to work 
out the cogelátions between X, and X, (r13), and between X, and 
X; (r23), as well as between X, and X, (ға)! We may, then ‘hold X, 
constant or ‘partial it out’ by the following formula: г. (meaning 
the correlation of X, with X, when X; is eliminated) = “ 
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Ex. 53.— Suppose the correlation between boets and handwriting 
speed to be + 0-45, that between boots and age + 0-70, and betweefi 
handwfiting and age + 0-60. Then the correlation between boofs 
and handwriting when tbe influence of age is femoved is: 


! Thouless (19392) hac pointed out that the partial correlation formula can 
only legitimately be used if X; can be measured without error. It is thus most 
appropriate for hoiding age constant, but not for, say, intelligence; since 
measurements of the latter always involve some error. However, Thouless pro- 
vides*an alternative formula Which may be used when the reliability of X; is 
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) 0-45 — 0:70 х 0:60 
i r= M = 0710%у/1 — 0:60: 
‘The Effect of Homogeneity and Heterogeneity upon Correlations. — 
We haye just'seen that»when pupils are heterogeneous as regards age, 
the ,correlation between а pair of their characteristics may be 
'spuriously raised. The same is'true of other types of heterogeneity. 
Suppose we calculate the correlation between two tests in an ordinary 
schod! class, and then eliminate from consideration most of the 
pupils of medium ability on both tests, and re-calculate the correla- 
tion, the second figure will certainly be larger than the first. The 
’ distributions will not, of course, be normal, the diversity or hetero- 
geneity of the pupils will be unduly large and this will increase the 
size of r. In ordinary practice the reverse, namely undue homogeneity, 
‚ 15 more likely tc occur without the investigator realizing it. Thus the 
correlations between intelligence tests and school work in an ordinary 
schoo! class are usually lower than they would be in an unselected 
group of children of the same age. For such a school class is more 
homogeneous as regards intelligence and school work than is the 
population as a whole. The dullest children of that age are likely to 
be at а special school, and many of the brightest ones may be at 
private schools. Take an extreme case: if we selected a group of 
children all of precisely the same intelligence, any correlation be- 
tween intelligence and work would disappear. One important reason 
why intelligence tests appear to be of much, less value for predicting 
scholastic aptitude among secondary grammar than among primary 
pupils, and to, be poorer still among university students, is that 
secondary pupils are more homogeneous or more highly selected 
' than primáry. Most of the children of I.Q: less than 110 are weeded 
out before the grammar school is reached, and university entrance re- 
quirements remove many more from the middle and lower end of the 
scale. Thus correlations necessarily sink as we pass from unselected 
children to the primary school, from the primary to the.grammar, 
and from the grammar to the university level. The typical correlation 
of + 0-85 which we have quoted for selection tests at 11 »} with 
later school work drops to about + 0:45 within a grammar school 
group. Е 
It is not easy to decide just what degree of heterogeneity should be 
regarded as normal and reasonable, nor to specify the extent to 
which any given Brcup of persons exceed or fall short of this degree 


D 
+ 0-052. 


of heterogeneity. Hence (ће correction of correlations either for , 
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undue homogeneity or excessive heterogeneity is complicated. The 
reader may be referred to discussions Бу Thomson (1939) and 
Thorndike (1949). а EU E 2 
Тһе main conclusions to be drawn from the above sections is that 
the investigator shouid always scrutinize his correlational data very 
carefully, first inorder to see whether any irrelevant common factors 
may be increasing spuriously the size of his coefficients, secondly to 
judge whether or not the diversity of scores in the group tested is 
unusually wide or restricted. He should realiZe that the absolute size 


of a correlation means very little, since this size may be so much 


altered by uncontrolled common factors or heterogeneity. His inter- 
pretation of the amount of relationship between the correlated 
variables should be made relative to the particular data with which 
he is working. 2 ү © 

In contrast їо the absolute size, the predictive valué or forecasting 
efficiency of a correlation is not affected by heterogeneity. For the 
latter, as we have seen, depends on о у1 — r*. Hence ап increase in 
r on account of excessive heterogeneity is offset by the increase іпіс, 
and the Р.Е. remains the same. It should be noted further that fore- 
casting efficiency is not in the, least upset by irrelevant common 
factors. True, it might be foolish to predict speed of handwriting from 
the size of boots, yet the correlation obtained between these variables 
would provide a valid means of making‘such predictions. 


COMBINING TWO,OR MORE TESTS 

The educ&tional statistician is frequently faced with the problem 
of combining the results of several examinations and tésts, and using 
the total scores for the selection of the best pupils. We haye already 
dealt rather fully with the point that the distributions to be combined 
should be expressed in equivalent units, i.e. that they,should possess 
the same means and S.D.s, except when it is desired to give more 
weight to one set of marks than to another,in which case the forther 
should have the larger S.D$ not necessarily the Јагрег mean. Here 
we must draw attention to an*irhportant fact which markers seldom, 
realize, namely that the S.D. or dispersion of combined variables is 
always less thah that of the separate variables by an amount depend- 
ing on the lowness of correlation between these variables. This fact 
is a natural corollary of the facts of regression outlined above. 

Suppose, for example, that totai marks are based on the average 
of six examinations, each with a mean mark of 60, a range of 30 to 


Ge 
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90, and а S.D. of 10. If the average correlation between the six 
»  , variables is + 0-70, the final distribution will probably have a S.D. 
» of 8-66 and the average totals will range from 34 to 86. But if the 
average intor-correlatjon, is only + 0:30, the final distribution will 
have,a S.D. of 6-43, and the;range will drop from 30-90 to 41-79. It 
» is edsy to see why this must be so. When the áverage'inter-correlation 
is moderate or low, no pupil or student will get very high marks, or 
very Jow ones, on all the examinations. Hence the highest and lowest 
average marks will be distinctly nearer to the mean than the highest 
and lowest marks on each separate examination. The same phenom- 

, erfon occurs when marks on different questions іп,а single exam- ' 
ination are totalled. For instance, if five questions are marked, each 
with a range of 5-20 out of 20, the examinatian totals will probably 

, range from about-40-85, and not from 25-100. : 

» A chief examiner, head teacher, or other marking authority should 
make allowance for these facts. If he wishes the distribution of final 
totals to shoW a certain proportion of very high (over 80%) marks, 
and a certain proportion of low (under 40%) marks, then he must 
either direct the markers of the component examinations to employ 
extra-wide distributions (i.e. to award much larger proportions of 
marks Över 80% and under 40%), or else he must re-scale the final 
marks so as to spread them out again into a distribution with the 
required dispersion. The fofmula connecting а, the S.D. of the com- 
ponent distributions, and o,,,, the S.D. of these distributions averaged, 


“ 
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1820, == „| 900 Z1) where R is the average correlation be- 


tween all the sets of marks, and л the number of sets.' 

Multiple Correlation.—f two tests or examinations each correlate 
moderately with a third, then the two combined will usually correlate 
higher with the third than either separately. For example, English 
and arithmetic examinations at 11 + usually yield córrelatlons of 
about + 0-40 with subsequent achievement in grammar schools. The 
combined mark is likely to correlate aboüt + 0:45 with such achieve- 
thent. Every educationist has realized and applied this principle. 
With the aid of statistical methods we can formulate it more precisely. 

Let there be n tesis which give correlations гул, Tors Ға... Cf, 


Grum 


5 may be substituted for c,,, where т, is the S.D. of the 


total scores. The same formula, turned around, is often useful for finding the 
average inter-correlàtion of п variables when their.separate and combined.S.D.s 
are known. Cf. Kelley (1924), Formula No. 171, 


! Alternatively, 
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‹ € 
with some measure of achievement, A, the sum of all these' coeffi- 
cients being Zr, ; and let the correlations of the tests with one an- | 


с В 
other be гуз, Рз Роз. - +» Etc., their sum being Zr,j. Then the correla- © 


tion of A with the combined tests is: б c О 
< 
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Ex. 54.—Three different types of intelligence tests gave correla- 
tions of + 0:483, + 0:456, and + 0:337 with students’ marks іп 
psychology. д, = 1-276. Their inter-corfelations were + 0-507, 
- 0-468, and + 0:508. ғ = 1-483. What will be the correlation 


‚ of the combined tests with psychology marks? ‹ 
1:276 PA ‹ 
X = Se = +522. 
га а te tase В” ““ 


The above ‘method is a simple but crude way of ‘combining tests 
so as to give better predictions of achievement. Naturally some of 
the tests are superior to others, and so should receivé more weight 
in the total scorer Further, the most useful tests will be the ones which, 
as well as correlating highly with A, do not correlate highly with the 
other tests that are being employed. For obviously, if tests 1 and 2 
are measufing almost precisely the same thing, they will predict, as 
it were, the same aspect of A. Whereas what we want are tests which 
predict different aspects, and so between them cover the whole of A 
more thoroughly. Thus it may be better to leave out test 2, and 
substitute another which; while perhaps correlating less well with A, 
has the advantage of being almost independent of test 1. The method 
known *asmultiple correlation enables the educationist or psycholo- 
gist to choose from a number of tests those which will in combination 
give the highest possible correlation with A, and to weight them 
sujtably. Multiple regression equations can also be oalculated which 
predict an ifidividual's most probable score on A hy combining his 
marks on the sepafate tests in appropriate proportions. The method 
is too elaborate to Бе діуей here, but the above'discussion will, it is 
hoped, indicate ifs object and ‘main principles. 5 

The vocational psychologist applies ап analogous ргоседите in 
selecting the "best candidates for a certain otcupation, that is, he 
gives a series of tests each of which has been found to correlate 
moderately with: success at the occupation, and combines them by 
the multiple correlation method $0 as to obtain аз accurate predic- 
tions as possible, D 
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^ 5 TEST RELIABILITY 

If a test or examination is appli»d a second time under similar 
conditions, and the testees’ scores differ widely from those previously 
obtained, the test is obviously a poor one. It is said to be reliable 
only if the two sets of scotes correlate highly with one another. 
Further, ii different testers аррју and score a test ог examination, 
they should arrive at the same, ог nearly the same, scores. Repetition 

of a test may, however, give an unfair picture of its reliability, since 
the testees may remember their previous responses. Many tests arc 
therefore supplied in two ог more parallel forms, so that if a re-test 
is desired, different questions may be set which should, nevertheless, 
yield much the same results as did the questions of the first test. 
When no alternative form is available. a single test is often split into 
two equivalent halves. For instance, the scores on odd-,and on even- 
numbered questions or items may be totalled separately, and then 
inter-correlated. But it is known that test reliability depends on the 
length of the test, hence the correlation between ore half and the 


: other half is unduly low, and is usually corrected by the formula: 


. Here r is the obtained coefficient, and R the coefficient 


Jr 
to be expected had it been possible to compare the whole of the test 
with another similar test. 

In all these instances—namely repetition, application and scoring 
by a different tester, parallel form, and corrected split-half—we 
generally expect to get an inter-correlation of at least + 0:90. For if 
this reliability coefficient is much lower it would indicate that the 
scores are too unstable to be trusted. Standardized educational and 
intelligencé tests generally reach this level of reliability, but ordinary 
examinations often fail to do so, as we shall see in Chap. XI. Many 
performance tests, tests of occupational abilities, and character ог 
temperament tesis, are also often poor in reliability. However, by 
increasing the length of a test, or by combining several parallel tests, 
aimore satisfactory coefficient may be attained. 

‘Knowing the reliability coefficient, it is possible to calculate the 
reliability, or the limits of variation, of iadividual scores, by the 
formula already given: P-E.estimx, = 0:67450,V 1 — r*. For ex- 
ample, if parallel intelligence tests give a correlation of -- 0:93, and 
the S.D. of their I,Q.s is 15, then Р.Е о, = 0:6745 х 15/1 — 09» 
= 3-71. But if the reliability coefficient i is only + 0:70, Р.Ело. = 
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7-21, that is almost twice as large. Such a P.E. is, as we have already 


seen, an index of the extent to which scores on the second test may | 


vary when predicted from scores on the firststest. About half the 


testees will obtain 1.0.5 ой the second fest within 4 points (if — 


r = 0:93), or about 7 ропиз (if r = 0°70), of their 1.0.5. on the first 
test. But rare cases (1 11100) may alter by as much as 4- x, P.E., d.e. 
by some 15 and 29 or more points respectively. po 

The application of two parallel forms of a test to the same indi- 
vidual yields, as we have seen, somewhat different. results. Jf we 
possessed a large number of additional parallel forms and applied 

«them, still other, results would be obtained (quite apart from effecis 
of practice). Such results for one individual would tepd to conform to 
4 normal distribution, and their grand average would, of course, be 
the best passible measure of the individual's standing, on the test— 
what may be called his true score. The situation is exactly analogous 
to that which we discussed in connection with group differences 
(Chap. V), except that there we were concerned with variations 
among the means of sample groups, here with variations among an 
individual’s scores on sample tests. An individual’s score therefore 
possesses another Р.Е., which indicates its liability to deviate from 
his true score. The formula for this also involves the reliability of 
the test, and the S.D. of the test scores of ап unselected group of 
persons, Р.Б. = 0-6745e/1 — г. Thus the Р.Е of 1.0.3 for tests 
with reliability coefficients of + 0-93 and + 0-70 are 2:67 and 5:53 
respectively, Note that these P.Efs are smaller than those quoted 
above, since УЛ — r is necessarily smaller than МЇ — r*. This is to 
be expected, since the former P.E. represents the deviation of one 
sample score from anothei sample, whereas the latter represents the 
deviation from the mean of all the possible samples. 

Factors Influencing Test Reliability.-—Since the reliability of a test 
is almost synonymous with its thoroughness, it сап Бе increased or 
lowered almost indefinitely, by lengthening or shortening the test. 
The formula for predicting tye, reliability of a test whose length is 


doubled has already ‘been cited. When it is lengthened n times, the 


formula becorfies: R= sen This dS known as the Spear- 


1--(а- Dy 
man-Brown ргорһесу formula. 
Ех. 55.—A short educational test has a reliability coefficient of 
4- 0.60. How much must,it be lengthened to attain a coefficient of 
_ + 0:90? Turning the above formula round, we get: 


4 
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Ic should be six times as long. 


< The same eM would apply to the following example. 
If two examiners mark a sei of examination scripts, and their marks 


E int»r-correlate + 0-60, how many examiners should mark the scripts 


(ог their combined marking” to attain a reasonable reliability of 
4-10:90? The answer is six. 


Bey [А E REI 
The formula assumes that the correlations between the additional 
tests (or examiners) wovld be the same as between the first pair of 


tests (or examiners). This condition seems generally to be satisfied’ 


by educational measurements. 

„А second important factor is the heterogeneity of the group whose 
scores provide the reliability coefficient. Parallel forms, of an in- 
telligence test will inter-correlate much more highly if applied to 
schogl-children of several years’ age range than when applied to a 
single class or to a single year-group (i.e. children whose ages range 
over only one year). Coefficients obtained from thé latter groups are 
conventionaily accepted as fairer indices of reliability than co- 
efficients from the former. If we knew о, the S.D. of test scores in a 
group which is unduly homogeneous or heterogeneous, then it is 
possible to predict R, the reliability to be expected in a group with 


S.D. 2, by Kelley's formula: R = 1 — те Е 


Ех. 56.—The reliability coefficient of an inteiligence test when 
applied to a class of primary school-children was + 0°88. The S.D. 
of their 1.0.5 was 14. What would be its reliability in a secondary 
school class whose I.Q.s have a S.D. of only 9? 


Е=1— x 012 = 4-071 
Note that thé P.E. of an individual's score is not upset by this 
factor, as is the reliability coefficient. In the above example it works 
out at + 3-27 in either class. Eo 
The assessment of test reliability and its implications are complex 
topics, and the account given here covers enly the more elementary 
points, Advanced students are advised to consult Thorndike (1949) 
and Gulliksen (1950). 


» 


Function Fluctuation—Most reliability coefficients fall below 1:0 


for two distinct reasons, first, through the test being insufficiently , 
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thorough (i.e. liable to errors of sampling), and, secondly, because the 
persons tested or examined may alter in their level of achievement 


between one test and another. ‘fhe latter factor. has been termed | 


‘individual variance’, or, ‘function fluctuation’ (Тћоцјезв, 1936), 
Naturally it has a greáter effect on coefficients obtained from parallel 


tests given on different«occasions, or from repeated tests, than it | 


does on coefficients obtained by thé split-half method. Hence the 
difference between these two types of coefficient affords a means of 
distinguishing between the reliability of thecfest itself, and the ten- 
dency of testees ‘to fluctuate in respect of the ability tested. Thouless 
provides a formula for estimating function fluctuation from this 
difference. 5 ұта е 
Effects of Test Reliability on Test Inter-correlations.—A test which 
is not perfectly reliable may be tegarded as measuring go much of thé 
function which a perfect test would measure, and бо much error. 
The error part of it is quite uséless, and will liave the effect of re- 
ducing the correlation between this test and any other test. If 
both tests possérs considerable errors through lack of reliability, 
their inter-correlation will be considerably reduced. іп fact, the 
maximum possible value of such an inter-correlation will be, not 1-0, 
but the square root of the product of their reliability coefficients. 
Suppose we find a correlation of + 0-50 between a selection ex- 
amination and subsequent achievement,and that the reliabilities of 
the examinations and the measure of achievement are only + 0-70 
and + 0-80. Then the trut correlation, if it were possible to obtain 
perfectly reliable exams and measures of achievement, would be 
+ 0-50" D 
1/0:70 x 0-80 — RE ) 
Spearman has called this reduction effect of imperfect reliability— 
‘attenuation,’ and has provided formule by which it may be corrected 
or compensüted. Such correction is, of course, of theoretical 
interest, since it tells us what correlations to expect between various 
psychological abilities'if théy.could be accurately:measured. But it is 
of little practical’ value, sinte completely accurate tests are ші 
attainable, and for all purposes of prediction we have to employ the 
obtained, uncorrected, correlation coefficients (cf? Thouless, 1939a). 
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“what abilities exist, and what each ability includes or exclude 


7 5 о 2 


9 р са AU E tM 


x | ANALYSIS OF ABILITIES 


Fallacious Views of Mehtal Organizatior.—Correlational investiga- 
tions һауе assumed ебресја! importance in recent years, since they 
enable us to clarify our, conceptions of human abilities, and of the 
crganization or structure of the mind. The layman's notions as to 
are 
extremely loose, and the same was true of, psychological theory 
throughout most.of last century. Teachers and parents are heard to 
remark: “Johnny is poor at book-learning, but he makes up for it 
by his cleverness with his hands. He will be a good mechanic when 
he grows ир.” “Магу has an excellent memory and good power of 
aitention.” “Willie is slow but sure,” and so on. Sreh statements are 
now know, as a result of experimental investigation, to be largely 
faliacious, Only exact study can tell us how far the slow person is 
likely to be sure, whether manual ability in a boy has ану predictive 
value for mechanical ability in a man, whether there is any such 
entity in the mind as memury. Again, the so-called faculty school of 
psycholegists of a hundred years ago considered the mind to be 
made up of a number of distinct pow?rs ог faculties such as reason- 
ing, judgment, imagination, etc. Indeed, it was at one time reg: arded 
as the main object of education to train each of these Тасшнев by 
providing children with appropriate ‘mental gymnastics.’ As soon, 
however, as scientific psychologists tried to define the faculties pre- 
cisely and measure them, and to determine the effects upon them of 
various types of training, they fell into discredit. A priori analysis ‘has 
noi even been able to decide what is the true nature of intelligence. 
The accounts of 15 given by different writers have been extremely 
‘varied and discrepant. Nowadays, therefore, such problems, both of 
tkeory and practice, ‘are approached by means of experimental 
studies of correlations between tests. For they all eventually resolve 
into the question—what correlates with what? i 
The Scientific Definition of an Ability —First we must define what 
we mean by an ability, capacity, ‘ог faculty. It implies the existence 
ofa EE or category of performances which correlate highly with 
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one another, dnd which atecrelatively distinct from (Le. give low. 
correlations with) other performances. Take, for example, mechan- 
ical ability. Some people are better than otherseat tasks involving © 
manipulatign of mechanisths, and this ађуну, is fairly consistent or 
general in the sense that those who are good at one such task are also. 
usually good at*others? The consistency is not perfect. A таў Бег: 
especially good with meccano, less clever with locks, B the opposite, | 
С the best with electrical fittings, and so on. But if there was no 
appreciable correlation between these and” other similar perform- 
ances, if they were all found to be speclfic, we should not be entitled 
to regard mechanical ability as a real entity. Another important , 
condition is that the performances skould be fairly reliable. We do 
not expect perfect stability, but if people fluctuated wildly in their 
success aj, mechanical tasks from day to day, -we.should hardly 
recognize the existence of mechanical ability. There i$ a further possi- © 
bility, namely, that the various performances may inter-correlate 
well, but that the correlations may be accounted for by some other 
common factor'&uch as general intelligence (the age factor, we will 
assume, has already been eliminated). Mechanical ability would not . 
then be anything distinctive.» However, intelligence tests can be 
applied to' the same persons who take the mechanical tests, and 
intelligence can be partialed out or held constant. It is then found 
that the mechanical tests still overlap, ог show positive correlations 
with one another over and above their correlations due to the intelli- 
gence factor. We are, thérefcte, able to accept mechanical ability as 
something consistent and distinctive. Tests of it do correlate suffi- 
ciently ‘well and reliably for us to postulate some common element 
running through them. „ D 

A common element such as this is often referred to as a group 
factor, since it occurs in a group „of performanges of a certain 
restricted type. It differs from a general factor dike intelligence, 
which is found to rûn through an extremely wide range of tests, which |... 
indeed enters to somé exteht into all abilities. © В 

Singe the consistency or overlapping of different mechanical tesis 
is not perfect, no one test can give а really adequate measuremen: of 
mechanical ability. But by combining several'tests we are likely to 
get a result much môre representative of the group factor as а Wnole. 
The same applies, of course, in the measurement of educational 
abilities. We do riot expect long-division sums alone to tell us how 
good pupils are at arithmetic, although-they provide some indication 
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of the ability. Instead we set several types of sums, and try to cover 


the whole field of arithmetic which the pupils are supposed to know. 


_Arithmetic is likely to yield a group factor similar to the mechani- 
cal ability factor. But many of the faculties assumed by psychologists 
in the past, or by laymen at'the present time, rail to stand up to the 


criteria enumerated above. Memory, for example, тегег5 to so many 


different things, that it is meaningless to talk of so-and-so as having a 
good or bad memory in general, or of ‘training children’s memories.’ 
The speeds with whishwpeople learn various sorts of material, and 
their retentiveness (i.e. the amounts of such material which they can 
recall after an interval), give only moderate or low inter-correlations, 
especially when the influence, of intelligence is removed. Probably 
there exist several small group factors, each representing memory 
for a certain narrqw range of material. In other words, there may be 
several different types, but no single entity, of memory. 

The notion that mental speed differs from, and is usually opposed 
19, mental accuracy or power also appears to be fallacious. If there 
were two quite distinct abilities, then tests of speedwhould give high 
correlations with one another, and low or negative correlations with 
tests of accuracy or power. Under appropriate conditions of testing, 
difficult untimed tests and easy speeded tests do yield' somewhat 
different results; thus a partial distinction is indicated. But they still 
correlate positively to a moderate or a high degree, showing that, on 
the whole, those who are ‘slow’ at mental work are more likely to be 
‘unsure’ than ‘sure.’ wu 

All Mental Abilities are Positively Inter-eorrelated.—Let us now turn 
to the wider problem—what abilities does the mind contaifi,and how 
are they organized or inter-related? A first, extremely important, fact 
is that all tests of mental abilities tend to give positive intec-correla- 
tions. Even tegts of manual, physical, and other non-intellectual 
functions also usually correlate positively with опе another and with 
merital tests, though the:coefficients are often very small, whereas 
the coefficients among tests of intellectual functions are generally 
taoderate to high. In selected groups; such as university students, 
occasional negative correlations may arise ; but these are not typical, 
and they are seldom statistically significant. In other words, talent and 
versatility, more often than not, go together; a person outstandingly 


ТА summary of the evidence on this and other types of ability, obtained 


between 1940 and 1950, is given in the writer’ 4 
Abilities (1950). , 1S ві iter’s book, The Structure of Human 
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good or bad in one field is.usually above or below average, re- 
spectively, in all other fields. This fact ateonce serves to casta great , 
deal of doubt on the ‘compensation theory,’ ike. the view that those 
who are poor at one thing make up for it by being good at something 
else. It does not entifely disprove itsfof the following reasogs., X 
First the reader, should study the scattergram for a, very Јолу • 
positive correlation, such as that shown in Fig. 26. This compares 


INTELLIGENCE TEST, (X2) 44 


HYGIENE MARKS (Xj) 


8 28 42 52 66 Ў, 57 44 79 10 556 
Бо» 26.—Stattergram and regression lines illustrating a Jow correlation, 
2 ғ 


; к= +0146. , 


ihe marks of stüdents in an, examination in hygiene with their in- 
telligence-test scores and yields an г of + 0-146. It will be seen ва}, 
though there is still a slight tendency for Шеф scores оп X, to«be 
associated with high scores оп X;; yet the correlation admits of many 


‚ big exceptions. Students who are among the highest 3% for intelli- 


gence may ranga right down to the lowest 18% for hygiene. The 
large amount of regression means that a fair proportion of persons 
can be high on one variable, low on the,other. Such a state of affairs 
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. relation is high (as in Fig. 23). 
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is certainly much more likely to occur пеге than it is when the cor- 


‚ Now ‘book-learniag’ and “being clever with the hands’ are not 
highly corrlated, hence, it is quite possible for some children to be 
much better at one than at tae other. But it is definitely false to 
suppose that they are inversely related, that thos? who are bad at 
one are therefore good at the other. Оп the contrary, those who are 
superior in intellectual abilities are on the average superior in prac- 
tical ones also, though’to a lesser extent. 

А second reason is that age differences often obscure the iss 
Some teachers are incredulous when told that children of supe 
intelligence tend to be superier also in growth and physical capacities 
to dull children, since they know well that the dullards in an ordinary 
School class are vsually larger and stronger than the bright young- 
sters. But, then, they are forgetting that the dullards are far older 
than the bright ones, and so may be physically superior. If the dull 
children were'compared with bright ones of the same age, the 
physical superiority of the latter would at once be'obvious. Never- 
theless, in this instance also the correlations are so low that a great 
many exceptions are possible. 2 

Thirdly, ability is obviously to some extent a matter of interest 
and practice. Pupils who are backward at intellectual studies and 
who аге on the average backward, but much less so, in physical and 
manual pursuits, naturally devote more time and energy to such 
pursuits. Bright pupils, though iritial'y stronger and better at prac- 
tical matters, may be more interested in intellectual matters, and so 
fall behind in bther things. It is possible, therefore, for the differences 
that exist between abilities which give low inter-correlations to be- 
come accentuated, though there is still no evidence known to the 
writer that the correlations ever become negative. 

We may conclude, then, that all the abilities with which teachers 
are'concerned are, to a gseater or lesser extent, positively correlated. 
As an example, we-will quote some of Eurt's (1917) results. He gave 
thirteen objective tests of achievemen: in various school subjects to 
120 children aged 11 "+, and calculated the inter-correlations. The 
following table shows the average correlation between each test and 
the*other twelve tests. ^ 


ЈЕ The ayerage reliability of the tests was + 0-738. АП these coefficients would, 
therefore, be larger by roughly 35% if the tests had been perfectly reliable (cf. 
p. 129). A later repetition of the investigation with a much larger number of 
children has yielded closely simitar results (Burt, 1939a). 
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Composition? history, géography, nature study, and arithmetic 
problems overlap with the other tests to the largest extent. Mechani- , » 
cal arithmetic, drawing, and writing quality correfate to the smallest ® 
extent. This shows that theré is a tendency for а pupil's aehievements 
in all subjects to run*on a fairly еуей l&el, though the correlations , 
| are low enough fer some*pupils to exhibit considerable dtvergences— И 
| special strengths in some subjects, special weaknesses in others. 
| During the secondary sclrool stage, correlations between subjects 
| decrease, partly perhaps because abilities тег to differentiate during 
у adolescence and specialised talents devélop, but mainly because the 
» pupils form moge homogeneous groups (cf. pp. 122-3). Nevertheless, e 
{һе positive overlap of all intellectual abilities continues even'beyond 
the university level. Fer example, the present writer inter-correlateg 
the marks gf 300 graduate students at a training college, and obtained — , 
small. positive" coefficients between subjects as widely different as — ^ 
ieaching skill, arithmetic, hygiene, education; psychology, speech 


training, singing; physical training, etc. (cf. Vernon, 1939). Ж 

| Composition - . i » + 0-505 

| History. ; 5 Е + 0448 $ 

| Geography М à . + 0:456 be 

| * Nature Study . 4 : » +0458 = 
Mechanical Arithmetic a - 0:283 
Arithmetic Problems . ARMS SLE РИ 
Reading Fluency - 4 . +0:323 
Reading Comprehension. . 0369 * 
Writing брееде . e -e 2. +323 

| Writing Quality ; г » +0274 

| Dictation. 2202022 +0325 

|| * Drawing . А џ | . + 0275 ‘ 

| Handwork 220202 2.4090. ‘ 


These facts are of considerable importance to examiners. They 
indjcate that a General Certificate in several subjects, achieved at the 
| age of 15 to 118, is some criterion of aptitude for any intellectual 
| occupation, or for d university career, albéit not a very good one. - 
| Similarly, the traditional metlrod by which the Ci¥il Service'selected, 
people {рг a wide variety of posts was а genepal examination based 
| largely on subjects studied at schogl ог university. This implied that 
| the possession of ‘brains’ for one subject indicates aptitude for all 
sother subjects. Though this method was a fairly sound one, the 
Civil Service has*now come to redlize that the efficiency ef its 
selection cam be improved by adding tests of the aptitudes needed 
for each type of post. г A 
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The General Factor Responsible forPositive Overiap.—Now when 
we find positive association, as in Burt's investigation, we are justified 
ih assuming the éxistence of a general factor—general educational 
ábility—wltich runs through all the separate subjects, Some sub- 

‚ jeqts may be said to be more ‘saturated’ with it than others, since they 
are more highly correlated with it. In the primary'school we should 
get an excellent indication of this general factor from pupils’ marks 
in composition and in arithmétic problents, a much poorer indication 
of it from drawing br handwriting quality. Statistical methods are 
available for assessing a pupil's standing on this general factor from 
his marks on all the component subjects (cf. p. 124). Further, it is’ 
quite easy to partial out the'factor, and so to see whether or not it 
décounts for all the correlations between the separate subjects, 
whether, in Gthet words, it is the only common element involved. 
When this is done, it is found that there are still a number of corre- 
latiens over and above those due to the general factor. In Burt's 
study, the residual correlations indicated considerable overlapping 
within the following four groups of tests: (1) arithmetic problems 
and mechanical arithmetic; (2) handwork, drawing, writing speed 
and quality; (3) dictation, reading comprehension and speed; 
(4) composition, history, geography, and nature study. But there 
was little or no overlap, or even negative correlations, between the 
tests іп one ofthese groups and those in another group. Such results 
show clearly the,existence of group factors, i.e. of distinctive types of 
educational abilities, over and above general ability. In this instance 
there is an arithmetical factor, a manual one, a linguistic one, and 
one including all the higher, more integrative, school subjects. As 
Burt points out, we have here a basis for cross-classification of 
pupils. He suggests that they should be assigned to school classes in 
accordance with their achievements in the linguistic and ‘integrative’ 
subjects, and that they might advantageously be reclassified for 
arithmetic lessons, and again for manual Subjects, 

E Analógous findings were obtained in the writer's study of training 
‘college marks. Three, separate ‘families’ of Subjects could be dis- 
tinguished, a practical one (teaching skill, speech training, and 
physical training), a scientific one (psychology, hygiene, arithmetic), 
and a literary one (English, speech training, ситан history). Few" 
researches along these lines have been carried oùt in the industrial 
sphere, owing’ to the difficulties of measuring and correlating 
different vocational abilitiés. But we might expect to get from them, 
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similar results, for example. group factors for the ehgineering type 
of occupation, for the salesmanship type, the clerical type, and so on. 


FACTORIAL ANALYSIS* * e 
The statisical investigations of Spearmaa (1927) and others һауе 
shown that it is possible to account “fos practically the whole of a 
set of test inter-tosrelations by postulating appropridte.common 
factors. It is legitimate, then, to regard each test as made up of te 
components, or as measufing two things. In so far as it correlates 
with other tests, it is measuring some factor or'factors. This is often 
referred to as the test’s communality. “Bus in so far as it fails to 
• correlate (and most test inter-correlations, be it remembered, ate 
only moderately high), it is measuring’a purely spacifié coniponent, 
or something which is»peculiar to that test alone and has no relation- 
ship to any other test. This is called the test's spectficky. It includes, 
of course, the error which is present in the test on account of im- 
perfect reliability? (cf. p. 129). Thus any test of an educational or 
vocational ability can be analysed into certain proportions of certain 
factors. For exafnple, in Burt's research, the arithmetic problems 
test can be analysed into so much general educational ability, so 
much arithmetical group factor (these together make up its com- 
munality), and, thirdly, a component specific to that particular test. 
The mechanical arithmetic test can be analysed into a rather smaller 
proportion of general factor, a large proportion of, the arithmetic 
group factor, and a separate specific component, The presence of 
these specific components reduce? the correlation between the two 
tests, but it is still quite high, namely + 0-76, because they have both 
а general*and a group factor in common. A test of another subject, 
such as handwork, still correlates positively with arithmetic prob- 
lems, because they are both partially saturated with the general 
faetor, but the coefficient is а small one, namely + 08, since hand- 
work tmbodies a different (manual) group factor, as well as a 
different specific,compohent. В 
1 Note that we do not claim td ве able to account for te whole of the correla- 
tions, for such correlatiens are, of course, always liable to errors of sampling. 
Often 11% residual coefficients, which are left after € general factor and some 
group factors have been extracted, are во small relative to their P.E.s, that it is 
not worth while analysing them further. Even with the most accurate techniques, 
2 such as Hotelling’s and Burt's, the later factors are generally so small that'little 
significance can be Artached to them. б 
з Thomson (1939) points out, howevey, that some of the influences producing 
unreljability, e.g. those of thg function fluctuation type, may actually be com- 


mon to several tests, and so contribute to theig communality. 
ж v 
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By means of factor analysis, then, it is possible te reduce a large 
number of test results to, a few underlying common factors. Any 


` psychological test, or measure of a scholastic or occupational 


ability, can be regarded as made up of some of these factors, or as 
possessing a certain ‘factor pattern.’ Further, ап individual person’s 
staliding on each factor can be derived from: his test scores, and he, 
too, possesses a certain factor pattern. Thus, a pupil who is high in 
general educational factor will-be above average in most of his school 
work, but his relative goodness at different subjects will depend on 
the strength or weakness of his group and specific factors. 

*Dangers of Factor Analysis.—At this point a strong warning should > 
be giver: against carrying the factorial conception of abilities too far. 
Factors are not entities in the mind whose nature or constitution, 
and whose strength or weakness, are ‘immutably fixed, Some writers 
do seem to assume that factorial analysis is revealing the funda- 
mental elements of which human minds are compounded, much as 
chemical analysis reveals the elements from which chemical sub- 
stances are compounded. We should realize {Бар factors consist 


. primarily of categories for classifying mental tests and examinations. 


Their value lies in the fact that they tell us objectively what correlates 
or overlaps with what, and so they take the place of unverified 
faculties and other subjective conceptions of human traits and 
abilities. , 

It is otviously illegitimate to compare factors in the educational 
field with chemical elements. For these factors, which are deduced 
from the correlations between various school subjects, must depend 
largely on how the subjects are taught and on the stage which the 
pupils haye reached. Different factor patterns will, therefore, be 
found in different school classes, according as the teachers connect 
up, or bring about overlap in their pupils’ minds between, the sub- 
jects. For example, Oldham (1937-8) found very diverse relation- 
ships between arithmetic, algebra, and geometry among different 
12- to 13-year-old zchool classes, which’could be ascribed to differ- 
éuces in teaching methods and in tlie pupils’ attitudes towards the 
subjects. Others have proved that the factors extracted from a set of 
mental tests alter progressively with practice at the tests. 

Like all other statistical constructs, factors will be liable to a good 
deal of variation if they are derived from measures,of small numbers 
of persons. But. apart from such sampling errors, they must be 


regarded as depending to 2. considerable extent on the particular 
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group of persoas measured. Thomson (1939) shows that they are 
markedly affected by. the heterogeneity „ог homogeneity of this 
group of persons. In addition, they are dependent upon the particulgr 
set of tests used, since the factor pattern of a test is in, essence ал 
expression oF the relations between this test antl all the other tests n 


the set. Statistical analysis of а single test in isolation is incapable of ; 


telling us what factors it contains (though subjective analysis, or 
inspection of its content, may often give useful hints as to its prob- 
able factor pattern, which can later be vérified by studymg ob- 
jectively its correlations with other tests). However, this limitation 
may not be very, serious, since, as we shall see below, at least some of 
the most important factors emerge in almost identical form from a 
wide variety of sets of tests. р 
Then there is the more subtte difficulty that the same set of tests 
can always bé factorized in a great many different*ways. Most of 
these ways will lead to quite illdgical results, воћа! the investigator 
has to choose from among the various possible factorizatiors the 
most meaningfiih and convenient pattern. Take, for example, the 
educational tegts applied by Burt. We have so far regartled these as 
analysable into general educatjonal ability, four group factors" for 
related Subjects, and separate specific factors for each test. New Burt 
showed, incidentally, that the general factor was not identical with 
general intelligence as estimated by teachers, or as measured by 
intelligence tests, though the two overlapped closelj. But.it would 
һауе been quite legitimatt to«nalyse measures of intelligence along 
with the educational measures, and to partial out intelligence first, 
so making it the pfincipal factor underlying achievement. И this 
had been done, it is likely that a subsidiary general factor’ would 
have emerged representing the pupils’ application to, and interest 
in, their work in general, Intelligence plus this subsidiary factor 
would.then*make up what we have previously denoted as the 
general educationaleability factor. The group factors would also need 
to be somewhat fnodifted, singe most of the fourth one (composition, 
history, etc.) might have been absorbed into, or accounted for by, 
the intelligence factor." eS ° 
In investigating students’ college marks, «ће writer obtained 
analogous results. Phe inter-correlations could be accounted: for 
either by a verye prominent general factor and group factors те- 


11р his later study (1939g), Burt describes several jnéthods of factorizing 


which do yield numerically different, though logically consistent, results. 
` 
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stricted to a few subjects, or else by three factors, allof about equal 


_ prominence, each entering into several subjects. 


Certain other precautions which should be observed in inter- 
preting the results of factorial analyses will be mentioned later. 
. Types, of Factor Pattern-Three main typés of factor pattern 
теси very frequently in educational and psychological investigations. 
They may be represented by the follówing diagrams, each of which 
refers to twelve еше tests, Nos. 15 2,... 


, 1. BI-PACTOR PATTERN 
» пе 1 
Group Factors | E 
7 General ес 
Те а “Factor | „ | plc | | ара 
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А Test | ‘Common Factors | Specific | 
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Gereral | Specific 
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6 Р:: .. 
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In the first, which Holzinger and .Hartmah (1941) call the bi~ 
factor pattern, there is a general factor running through all the tests, 
four different greup factors, and specifics. This pattern (or simifar 
more complexeones) is usually adequate for representing the results 
of tests of abilities. For instance it applies excellently to Burt's 
investigation. 4 4 

The second, multiple-factor, pattern requires three, four, or more 
factors, none of them general, but all running through several tests, 
to account for the test inter-correlations. It will be seen that Test 1 
involves factors A and Сапфа specific. But as far as possible each 
test is covered by a single factor (with only negligible loadings on the 
others). This type of pattern is referred to as Simplé Structyre (cf. 
Thomson, 1939; Thurstone, 1947). It is often used in analysing tests 
of abilities, but is still more appropriate where tests of personality, 
character, or temperament are concerned, since im these fields a 
generad trait’ running through all the tests, would seldom be appro- 
priate. For example, Thurstone (1931) obtained measures of (һе 
interests of a group of studfats in a large number of different occu- 
pations—adyertisihgeart, chemistry, etc.—and found that the interes: 
scores could be classified under four main factors. These he rougltly 
identified as interest in language, in science,” in’ business, and in 
people. Each separdte interest could be made up of appropriate 
proportions of tkese four factors. Other instances have begn sur- 
veyed by Eysenck (1953), ihe ) | 

Spearman's Two-factor Theory.—The third, unitary-factor, pattern: 
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has aroused a great deal of controversy. Historically,it was the first 

type to be discovered. Early in the present century, Spearman found 
` that diverse tests of mental abilities usually gave inter-correlations 

which could be wholly accounted for {within the limits of their 

errors of sampling) by a single general factor plus specific factors. 

‚Не enunciated certain criteria to which the:correlations must con- 

form if this is to be possible, the most important of which is called 
the tetrad difference criterion. The general factor obviously cor- 
responds to what we commonly mean by general intelligence, but 
Spearman preferred to symbolize it by the letter g, the specific factors 
by letter 51, 52, - «+» €t SO as to get over the difficulty, mentioned 
above, that the term ‘intelligence’ has been so diversely defined by 
different psychologists. His g factor, it was claimed, would be t! 
same whatever the tests used. It could be measured, for example, by 
a battery of tesis consisting of Analogies, Completion, Vocabulary, 
and Directions, or equally well by a battery consisting of practical per- 
formance tests (pp. 160-2). Each test would have its own independent 
5, out in so far as it overlapped with other tests, it would be measur- 
ing g. This view is widely known as the Two-factor Theory. 

For a while it was thought that the Two-factor Theory would 
apply tc all tests of abilities, not only to intellectual ones, and that 
very few additional group factors would be needed. Only occasion- 
ally would a few tests which were very similar in content (e.g. tests 
of musical aptitude) be found to inter-correlate over and above their 
correlation due to g. у 

Different tests may differ considerably in their g-saturations. 
Those with the highest saturation seem to involve the capacity for 
seeing relationships, whilst tests which involve mainly mechanical or 
rote memory contain very little g. Among school subjects, classics is 
most dependent on g; French, English, history, and the like are also 
highly saturated. Manual subjects and music have much: more 
prominent s’s and relatively little g. Thus we find high correlations 
between classics and English, low correlations between handwork 
and music, and moderate correlations between classics and hand- 
werk, or English and music. The Two-factor Theory not only 
explains these ап@ similar facts as to the relations between abilities, 

з For an explanation, see Knight (1933), or Spearnian’s own book (1927). 
The apparent discrepancy between the names—Two-factor 1 heory and Unitary- 
factor Fattern—is merely due to the fact that Spearman includes the specific 


components as factors, whereas the terms we have used refer only to the.com- 
munality of the factorized tests. - 
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but also provides the mental tester with a scientific basis for choosing 
tests which will give the best measures of g. Although every sub-test | 


in a battery of intelligence tests ‘will involve a.certain s-factor, tests • 


can be selected in which these s’s are small, and when фе tests ase 
combined the s's wil? tend to canceleouf, so 1ёауіпр an almost pure 
measure of g. • , • 5, Sp S 
Criticisms of the Two-factor* Theory.—The theory has been syh- 
jected to much criticism, both on psychological and on statistical 
grounds. Spearman suggested that g might*correspond to the con- 


ception of ‘general mental energy.’ Whether or not this is a psycho- 


logically sound explanation need not concern us here. On the „ 


statistical side, Thomson (1939) and.others pointed’ out that, al- 
though the analysis jnto general and specific factors is legitimate 
when the tetrad difference criterion is satisfied, yet this is not the 
only possible inode of analysis; that if mental abilities consisted of 
Jarge numbers of overlapping group, factors; the same criterion 
would still be satisfied. Thomson's view would seenr to the present 
writer to be preferable from the psychological standpoint, but 


Spearman's мем is perhaps more racticall; convenient, since we do « 
P! р 


already in everyday life assume the existence of something like 2— 
whethet ме сай it ‘brains,’ “brightness,” ‘intelligence,’ or what not— 
and the Two-factor Theory provides a scientific basis for this com- 
mon-sense assumption. 

The theory, and the experiments which it has stimulated, at one 
time seemed to discredit МІ the mental faculties which had so often 
been abused in traditional psychology and in common parlance. But 
more recefit research with modern factorial methods*indicates that 


the theory requires considerable modifications in the matter of , 


group factors. A large variety of group, or multiple, factors has been 
described, particularly in American regearches, and seme of them do 
resemble the* older faculties. Unfortunately, there are considerable 
disagreements between different factorists es to which dre the most 
distinctive and fundantental? but the series that Thurstone established 
is very widely accépted. Thurstone (1938) applied ап extensive bat- 
tery of ‘psychological tests to university students and found that they 
resolved into’ factors which he identified аб verbal ability (V). 
number ability (№), word fluency (W), spatial judgment (S), rote 
memory (M), perceptual speed (P), and deductive and inguctive 
reasoning (D, Г). A fairly similar picture was obtained in studies of 
children, even down to the 5- to 6-year level, andThurstone has issued 
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sets of tests for measuring these factors at several age levels, namely 
„ the P.M.A. or Primary Mental Abilities tests. Note that he does not 
» allow for ag running through all these abilities, though Spearman 
and his followers have shown that a pattern of general -- group 
factors fits the same results ‘as well as, ог bettèr than, a pattern of 
-indépendent-multiple factors.’ ө 
„Ш may safely be concluded that no test measures nothing but g and 
a specific factor, since the type of test material employed always 
introduces some additiohal common element. Ай verbal intelligence 
tests, together with other fest$ depending on manipulations of words, 
. involve a verbal and educational factor, which has been named, У or 
у: ed. Those based on multiple-choice items, to be done at speed, 
embody a further factor or factors, not present in unspeeded, 
creative-response tests such as Terman-Merrill. In order to escape 
the influence of v: ed, many mental tests have been constructed from 
abstract diagrams or pictures, Such tests, however, in so far as they 
involve spatial "imagination or the mental manipulation of shapes, 
yiélda spatial factor called k (El Koussy, 1935), or S (Thurstone, 1938). 
Certainly some, and possibly all, of the common non-verbal per- 
formance tests bring in this same k, and various minor dexterity 
factors in addition. Mechanical ability tests too measure ж, together 
with a mechanical information or experience factor, m. School 
examinations and educational attainments tests depend (as Burt's 
research showed) not only on g and v : ed, but also on a factor which 
represents: the pupils’ interest invand»attitudes to work, and their 
character or temperament qualities. Following Alexander (1935), 
this is often referred to as the X-factor. In addition, there art probably 
. group factors for families of school subjects such as the linguistic, 
scientific and mathematical, practical and manual, and fuether sub- 
factors largely governed, as we have seen, by teaching methods 
Educational and Vocational Applications —The findings of fac- 
torial research have important educational ard vocational bear- 
ings. If the Two-factor Theory was eXclusiVely accepted, educa- 
t'onal and vocational guidance by means of tests*would scarcely be 
possible. If we wished to advise a child as to the type of education 
1 А more detailed discussion of the pros and cons of the 2 + group-factor and 
the multiple-factor approaches is given elsewhere (Vernon, 1950), and it is shown 
that they can fairly readily be reconciled. 7 
_ 2 А]еўапдег (1935) claimed the existence of а distinctive practical factor, F. 


in certain регїогтпайсе tests ; but later work appears to indicate that it cannot be 
differentiated from k. ] 
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or the occupation for which he "would be most Suited, we could 
apply tests of g and so deduce the general intellectual level of the 


work he should take up. But we could not decide whether Ке would “e 


be likely tg do better at ckassics or science, at a clerigal job or an 
engineering trade, dnd so on, except Фу giving a vast number of 
tests of differents factors, or by lettipg him try each type of work in, 
turn so as to find by practicat ехрёпепсе which specific one suitgd 
him. There could be no recommending of general lines of occupation. 
To a certain extent this does seem to bÉ true. Educatidnál and 
vocational guidance among adolescertis are extremely difficult and 


uncertain. They have to be based more upon a subjective analysivof , 


interests and character traits, background and гејеуапе trairfing, than 
upon objective predigtions by tests. Yet common experience strongly 
suggests that there do exist general categories oretypes of work; and 
that thereforé a number of group factors, additional to т, could be 
established by further research. If thjs were done, guidance would 
become much more scientific. Full consideration of a candidate's 
interests and character would, of course, still be required, but *his 
scores on a series of tests for each of these factors would give 
objective indications of his probable ability along each of the main 
lines of work. Much the same is true in clinical work with malad- 
justed children or neurotic or psychotic adults. Psychiatrists and 
clinical psychologists often claim, on the basis of poorly validated 
tests, that that their patients show deterioration or weakness in such 
faculties as memory, orifntation, attention, visualization, etc. Factor 
analysis indicates that most of these tests are merely rather un- 
reliable afeasures of g, or g РУ: ed. But at the same time it.can be 


. >; ; 

1 Thomson (1939) has pointed out that when tests are given with the object 
of predicting educational or vocational achievement, the extraction of factors 
from the tests, may constitute quite an unnecessary intermefiate step. A much 
more efficient method is to correlate the tests directly with measures of achieve- 
ment and then to use*multiple correlation for obtaining the most accurate pre- 
dictions. While thts view. is, ofcourse, perfectly true from the statistical stand- 
point, it would seem to the prefent writer to pe practically applichble only, in 
educatipnal and vocatfonal selection, not in guidance. It should certainly фе 
used by the tester who wishes to select the best catfdidates for some singlecana 
definite educatfonal course br job. But the problems, of the yocational adviser 
are much too complex {o be tackled as yet by this exact method. He is forced to 


predict a candidate's probable achievement in generalized terms, or.factors. At * 


the present time he is doing so Jargely in terms of unverified factors, derived 
from subjective considerations of testsand lines of work. The view put forward 
above is that his procedure eould be made more systematic'and scientific if more 
factors were experimentally established. EIS RODA UE 
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used to isolate definite group factors such as теша] speed, and 
flexibility in the formation of concepts, which are especially apt to 
bz affected by abnormal mental states; and it can show which tests 
ase the most promising for measuring such factors. | 
While we have a fairly clear conception of the main factors under- 
lying the abilities of school-children and of unselected adults,. the 
picture among older adolescents and adults of superior ability is 
much more complex and obscure. A g-factor can still be extracted 
from their mental test results, but it seems to play a decreasingly 
important part in educational or vocational achievement, and inter- 
esis, work attitudes, and temperament traits become more influential. 
Animportantseries of researches by Guilford etal (1950—4)in America 
with high-grade groups such as air-force pilots and university 
students suggests that numerous intellectual factors can be dis- 
tinguished in the fields of reasoning, planning, judgment, originality, 
сіс. ; and that it wouid be more profitable to develop tests for these 
than for a vaguely conceived general capacity of intelligence. Which 
of them will emerge as consistent and clear-cut, апФав educationally 
‚ or vocationally important, among other groups (е.р.- British gram- 
mar school pupils and students) is still a matter of doubt. 
Conclusions about Factor Analysis.—A. great deal of unavoidably 
complex and expensive research will be needed before we can hope 
to arrive at a reasonably complete map, as it were, of the whole 
realm of human abilities. The difficulties are enhanced because, as 
we have seen, the factor patterns whica an investigator extracts аге 
often dependent on the particular tests and particular testees he 
employs, and because there is much room for subjective judgment in 
the type of factor pattern he chooses. Research at present is rather 
badly co-ordinated, since each investigator sets out to ехріоге fresh 
territory instead of trying to build on the factors already fairly well 
established by previous workers. Further, there is always the danger 
of mathematical rather than psychologicai considerations getting 
the upper hand. As already mentioned; factorial investigations are 
nat revealing the elements out of which the mind is compounded, 
bui are classifying test performances with a view to predicting more 
effectively other performances, e.g. in the educational or vocational 
sphere. But by no means all of the importaüt facets of human 
nature are susceptible yet of accurate measurement. There may be 
abilities, such as.executive capacity, intuitive power, goodness of 


teaching, and social intelligence, which are generally recognized in , 
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daily life, but which are not yet properly substantiated or analysed 
owing to the inadequacies of our tests. Still more is this true іп ће, „ 
realm of emotional traits and interests, for reasons which the writer * 
has described elsewhere (Vernon, 1953). Yet we canaot hope to 
make successful prédictions about aif individual child or adult | 
unless we know every side of his make-up. It may weil be thaf all + 
these attempts at analysing the measurable abilities and traits, of 
human beings will never give us a picture of an individual personality 
considered аз a whole, that we shall always‘netd to supplemient test 
results with subjective judgment and intüitign in order to resyfthesize 

» them, into a living entity. But this, too, is a matter of psychological , 
controversy which the reader may follow up elsewhere. 

We have no spacesto discuss the actual statistical techniquessof 
factor analysis. Spearman ha$ fully described his method in The 
Abilities of Man (1927). It is not difficult to apply, though rather = 
laborious. It is the most accurate method of extracting unitary- 
factor patterns, but it is hardly suitable for bi- and ‘multiple-factor 
patterns, Holzinger’s (1941) method is an extension which dedls 
readily with biefactor patterns, though it does not seem to possess 
much advantage over the flexible group-factor techniques outlined $ 
by Burt (1950). The most widely used technique is Thutstone’s 
centroid method, which is well-described by Guilford (1936), and 
is not difficult to acquire. But it has many more elaborate refine- 
ments, for which Thurstone’s book (1947), or Cattell (1952), should 
be consulted. Other mote atcurate techniques, whose aims and 
guiding pringiples are somewhat different from those described in 
the present chapter, have been developed by Hotelling, Kelley, and 
Burt. The clearest account of the whole subject is that of,Thomson , 
(1939). • 


© 
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oc MENTAL TESTS. |. 


MENTAL testing is simply a more refined and scientific method of 


doing something which we all of us do every day of our lives, that 
is assessing one another’s abilities and character traits. Our ordinary 
method is to observe a person's behaviour, including his facial 
expressions and gestures, іп various situations and circumstances, 


and his speech or written productions. For example, we might watch 


a small boy playing with bricks, aud from his skilfulness, or the 
elaborateness of the resulting construction, jump to the conclusion 
that he is quite a bright child. Obviously the basis for this and other 
such judgments is very haphazard and unreliable. There is no sure 
evidence that the boy is acting more, or less, intelligently than the 
majority of'children. Nevertheless this performance may quite well 
‘be refined and turned into a scienjific mental test. So-called per- 
formance tests of intelligence are based-on just such activities with 
bricks, blocks, and other concrete material. 

Let us then first consider the essential features of a mental test 
which distinguish it from such everyday judgments and from the 
slightly more scientific school examination. 

1. Standardization of Test Situation.—The first step which the 
tester takes isto standardize the test situation. Не will; Гог example, 
adopt some standard set of bricks and give definite instructions, so 
that all children who are subjected to the test shall do it under as 
nearly as possible identical circumstances. Until this is done itis not 
legitimate to compare the performances of different children. With 
every good test there isspublished a detailed sct of instructions, to 
which the tester must closely adherer School and university ех- 
&miners, on the other hand, usualiy give по instructions beyond 
stating the number of questions to be answered, and they neglect to 
inform the examinees what type of answers they require. Because of 


this, and because of frequent ambiguities in theawording of questions, , 


different examinees may conceive the tasks that-are set them in a 
variety of different ways, and її becomes exceedingly difficult to 
compare and mark their answers. ~ 
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The materialeof а test, iie conditions of application, and 46 in- 
structions are further planned to be readily comprehensible to, and 
suitable to the mental level of, tffe children or-adults for whom it is 


intended. This is another feature which some schoob examiness 


might be advised to Сору instead of setting Subjects for composi- 
tions, or other questions, eis аге, {аг above the heads of The 
pupils. s 

We must, however, admit that several Widely used tests do fot 
entirely live up to these criteria. Some of th$nv were originally con- 
structed and stahdardized in America; hence they require consider- 

sable modification or translation before they are suitable for British 
children. We aré forced to employ these *second-hapd' tests because 
the production of fresh tests is an extremely lengthy and expensive 
business. Unfortunately different British testers «do ,the necessary 
*re- -modelling în different ways, so that the test situation i is no longer 
identical for all. : 

2. Standardization of Scoring.—The mental tester does not ob- 
serve the child's behaviour, or his finished product, in the casual 
manner implied by our illustration, but obtains an accurate record 
of them. He insists on knoying definitely what was the test response, 
іе, the Child’s reaction to the previously laid down test situation. 
Either the response is something which can be delimited in quantita- 
tive terms (e.g. so many bricks put into their correct positions in so 
many seconds), or else it is something which can be designated un- 
equivocally as right or wróng.!slt is most important that the personal 
opinion of the tester should play no part in this record: Any two 
testers observing and*scoring the response should arrive at identical 
records. If this condition holds the test is called an objective one. All 
good mental tests, then, are provided with scoring keys or manuals 
of instructions which enable the recording and зсоп to be done 
objectively. » 

Our everyday. judgments of traits or abilities are, by contrast, 
decidedly subjective, for ntmerous investigatiens have «demon- 
strated their liability te vary with the persoh who makes them. Some 
teachers’ are still convinced that they can assess the intelligence of 
their pupils better than can any test. But the fact'is that different 
„teachers, assessing thesame pupils, give such different opiniqns that 
the scientific psychologist can put little trust in any of them. The 

So-called Quality Scales arg tests whose scoring constitutes an exception to 
this rule, cf. p. 154. i: 
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same is true of many school and other examination marks, as will 
appear in Chap. XI. 

3. Norms.—Tae layman commonly seems 10 regard the ob- 
served behaviour as having some absolute value or significance. 
There is no doubt in his mind that it indicates high intelligence, or 
poor sociability, or whatever the trait may be. The more cautious 
psychologist realizes that such judgments must be based largely on 
more or less vague recollections of the behaviour of other similar 
children under similar’ circumstances, and so insists that all grading 
of abilities is essentially a comparative or relative matter (cf. Chaps. 
"П and IV). Thus one of the most vital accompaniments of a mental 
test is’ the norms, by reference to which any child’s goodness or 
poorness of performance may be assessed. Unfortunately it is also 
the feature in which many published tests are defective, on account 
of the difficulties already described (pp. 66-73). ” 

In the nature of examining it is hardly possible to establish norms 
for examinations, since they cannot be applied to large numbers of 
pupils or students beforehand for this purpose, asis the standardized 
(еі. We һауе seen, however, that the process of rationalizing marks, 
which is essentially equivalent to the, setting up of provisional 
norms, is necessary if the present ambiguities in the significance of 
school or examinations marks are to be avoided. 

4. Reliability and Validity.—A single piece of behaviour, 
casually observed, as in our illustration, would probably be very low 
in reliability. A’discussion of the statistival methods of determining 
reliability was given above (p. 126), and the reliability of several of 
the more important tests is mentioned later ih this chapter. 

The most serious defect in the layman’s everyday judgments is that 
he seldom has any good proof that the observed behaviour validly 
indicates the ability which he deduces from it. It may have arisen 
from some quite different trait or ability, or else from an ability 
which is almost whollysspecific (in Spearman's sense), i.e. not corre- 
lated with anything else. To be reliable a'test need only measure 

© accurately the ability to perform tha: test, Би“ te be valid it must also 
correlate effectively With other measures of whatever it is supposed 
to test. The mere inspection of the content of a test is ‘seldom sufficient 
for the determination of its validity, although it is a necessary 
preliminary. > ‹ 

Such inspection of educational tests and examinations may reveal 
their failure to cover the ground which the pupils should have 
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accomplished, Thus school and university examinations are often de- 
fective because their questions are too few,in number, or are badly set. 
It will be obvious also to a testef that the contentof some of the pub-'e 
lished educational tests 1» unsatisfactory, either because they are 
designed für pupilsewhose tuition differs witlely from that current 
among his testes, or because they are out-of-date. But ‘apart from 
such flaws, many tésts involve*grou or specific factors which degract 
from their effectiveness as measures of Some general ability. For 
example, a test of reading may be based ‘or the capacity "fór pro- 
nouncing difficult words. This may or may not indicate children’s 
capacities in other aspects of reading, such as their fluency, or {Мет 
comprehension of what they have read. Investigation by Means of 
correlations is essential for establishing these points. Again, „ће 
technique of the test, i.e. theeway in which the testge is required" to 
express his ability, always seems to introduce an frrelevant factor. 
In Chap. XI we shall show that in an ordinary essay-type of exam- 
ination, the translation of the examinees’ knowledgeinto essay-form 
is one such facfgr, and in Chap. XII the examiner’s translation of ‘his 
questions into new-type or objective form will be recognized аз an- 
other. Temperamental factors, health, fatigue, and the like" also 
influence ә testee's performances at many tests, and so redtce their 
validity as measures of ability. 

Most educational and vocational tests are assumed not only to 
be valid indicators of some present general ability, but also to be 
prognostic of future aclftevement, Their predictichs should then, of 
course, be followed up and compared with subsequently obtained 
measureg»of achievement. In the educational field there have been 


many such investigations whose results usually show our tests and. 


examinations to possess disappointingly poor validity (cf. Chap. XI). 
Vocational psychologists have also tried to correlate their predictive 
tests with some measure of output, or with estimates of efficiency 
obtained from supervisors or foremen. Aseshown elsewhere (Vetnon 
and Parry, 19495, however,*if is not always easyeto secure a satisfac- 
tory criterion of success or feflure in ай occupation against whith 
the test results can be checked. y 

The validafion of intelligence tésts, or of tests of capacities such as 
practical ability, verbal ability, and the like, is particularly, difffcult, 
since we posses» no objective criterion of intelligence, etg., with 
which to compare them. | Ratings have sometimes keen used, "that i is, 
teachers? estimates of thé ШЕШ Soa of.children i in their charge, but 
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these are highly fallible. If psychologists are unable to agree as to 
what they mean by intelligence, it is unlikely that laymen’s views will 
bu any more concordant. Indeed, if teachers could judge it accurately, 
there would»be no need for tests. The тізіп defect of such ratings is 
that they are strongly biased, (often quite unwittin gly) by considera- 
tions such 45 the children’s ipdustriousness and good school be- 
haviour. Parents’ judgments are still less trustworthy. Several group 
"intelligence tests have been validated by comparing their results with 
the results of one of the Binet scales. But the validity of these s iles 
is ope! Ay doubt. The items’ which they contain have always been 


chosen sd.as to differentiate older (hence presumably more intelligent) » 


children front younger (less intelligent) ones. This criterion is, how- 
ever, dubiots in value, since it would admit tests of, say, height and 
weight which Рауеуегу little validity as indicators of intelligence. Yet 
we do not rely merely on subjective judgment when we state that these 
must be excluded from a test of intelligence, for in the absence of 
апу external criterion of validity, we make use of what has been called 
the method of internal consistency. 2 

This method consists either of factorial analysis, or of a modifica- 
tion thereof. All the test items or sub-tes's are selected in the first 
place on the basis of subjective opinion; they must appear to the 
tester to involve the exercise of intellect. Thereafter they are studied 
empirically, Either each item is compared with every other, or with 
the sum Of all the rest of the items, or sets of items such as the sub- 
tests in a group intelligence test are compared with other sets. The 
inter-correlations will serve to show whether or not all the items are 
consistent with one another—whether they агё теавиги опе and 
the same ‘hing. And items that fail to conform to this criterion are 
excluded from the final, published, version of the test. The'statistical 
conditions imposed by Spearman in his factorial analysis of group 
intelligence teste are much more rigorous than those uséd by ‘Ferman 
in revising the Binet scalé. Both, however, prove that there is a gen- 
eral factor running through all their test {tems or sub-tests. Spearman 
znd his followers wish to measure nothing биће, that is, they aim at 
a üni-factor райегп, апа so exclude a number of sub-tests which fail 
to conform to such à pattern. The Binet test, on the other hand, al- 


though it certainly embodies a g-factor, contdins a hotch-potch of | 


heterogeneous tasks which would, if fully analysed, yield several 
group factors (2: bi-factor pattern), or even a set'of multiple factors 
(cf. Burt, 19395). The effects of these different techniques of test 
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construction will be discussed more fully below. What matters for 
our present purpose is that when a test, or a set of tests, 15 infernally , 
consistent, this signifies that it will validly predict any other Be- * 
haviour which involves thé same general, factor. Intelligence tests 
work fairly effectively because schofastic success, and many ОЁ | the. 
other intellectuaf activifies of everyday Ше, are known to,be depen- 
dent largely on the g which théy measure., .. 
Test Construction.—Before proceeding, we would strongly urge 
amateur testers, and teachers лог to attempt to make up their own 
tests, apart from new-type ones Гог schóol examination purposes. To 
devise test items similar to those which appear in published tests "15 
not difficult, but a great many technical features enter Into the selec- 
tion of suitable нет» and into their standardization, in addition*fo 
those described here. The points which we һауе Outlined are meant, 
not to show how to construct a test, but to enable amateurs to choose 
more wisely the tests which will be most suitable for their purposes, 
and to give them some assistance in understanding what they ate 
doing when theySapply published tests. H 
* e 
B TvPES OF TESTS 5 
At the Sit of this chapter is given a classified list of most of the 
mental tests available in this country. The main headings are: 
. Attainment, Achiqyement, or Educational Tests. 
A. Examinations. t 
R, Standardized Educational Tests. а 
> 1. Individual reading tests. у 
2. Language development and Vocabulary. 
3. Individual oral arithmetic. 
Qu , 4. Group tests. 4 


П. Individugl Intelligence Tests. e 
A. Versions of thæBinet-Simon scale. e " 
B. Wechsles scales. o * s 5 

• С. Miscellaneous scales. 0 $ 


р-Е: Performañce Tests: © 


>, 


Ш. Group Intelligence Tests. Ы 

- А. Oral? у . e 
B. «Non-verbal. e ... 
С. Verbal. 
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‚ тү. Aptitude and Special Ability Tests. 
У. Personality Tests and Assessments. : 
The last two sections m2rely mention some of the testsjcommonly 
‘applied n this country, but the first three aim to be comprehensive. 
Although we Shall not attempt te-describe particular tests, the follow- 
ing general explanations of, and comments on, the main types may be 
usefulto those who are not already thoroughly familiar with the field. 
Examjnations and Educational Tests.—Examinatiens are still the 
most important method of educational measurement, and so come 
at the head of our list. Their defects, and the possibilities of improv- 
ing them, or of employing substitutes, are discussed in Chaps. XI- 
XN. Standardized educational tests must be contrasted with them, 
for although they are made up of the same kinds of items or'questions 
as occur in objective or new-type examinations, their functions are 
quite different. They are not ‘designed to cover the work done by 
particular classes of pupils or students. Nor are they meant, pri- 
marily, for selecting pupils who are most likely to do well in some 
futuze work, though they are nowadays quite commonly used for 
assessing attainments at the 11-- seléction stage. Rather they serve 
to show what is the standing of a class, or of an individual pupil, 
relative to other similar pupils, either оп some general subject such 
as English or arithmetic, or on special branches of, or stages within, 
a subject (these latter being known as diagrostic,' or analytic, tests). 
The fact that they are published, and their expense, obviously make 
them unsuitable for use as school examinations, By means of them, 
however, a teacher can usefully grade his pupils at the beginning of 
the school Year, so as to find which ones will need most coaching, ог 
in which parts of a subject they are most adyanced or backward. If 
the norms are Adequate (cf. Chap. IV), they will tell him how kis 
pupils compare with pupils in general of the same age. The psycholo- 
gist at a Child Guidance Clinic, or the vqcational adviser, finds them 
especially valuable for assessing a child's strong and weak points. 
~ Quality Scales.—Although most educational tests consist cf large 
numbers of short itenis, both so as to cover the subject as thoroughly 


_ as possible, and so as to admit of objective scoring (cf. Chap. ХП), 


some edücational products which are too complex to be assessed in 
this pigcemeai fashion can nevertheless be measured by *quality* or 


1 Cf. the excellent discussion of diagnostic tests by Sch 11 (1950); also | 
Whilde (1955). soe ite ју Soe! CMM 
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* product зса]ез Such standardized scales are available for grading _ 


handwriting, composition, drawing, etc. Bach of these consigts of a , 


series of specimen products which have been carefully chosen as 
representatiye of certain levéls of achievement, The testee is told © 


write or draw the same matter in his п®гтйа1 manner, and his product , 


is, as it were, slid*along*the scale, or compared with thé specimens, 
until one is found to which it corresponds in achievement. It is ther 
given the score of this specimen. Such grading cannot be made 
wholly objective, but there is evidence that it'infproves with practice. 
Different testers will, when experienced, award the same or' nearly 
(Һе same score фо a child's product. Certainly the process is fairér 
than grading without any external standards. Henceghe Adoption ofa 
similar plan in the marking of school work was adyocated in Chap. #1. 
Burt’s scales all consist of specimens selected from a large number 
as being typical of the work of average 5 + ,6+,...14+ child- 
ren. Schonell’s composition scales, covering the 7- to 14-year range, 
are similar. Catfell’s handwriting scale for school-léavers and his 
drawing scale fof adults are merely graded I to V, representing 
different degrees of merit at these age levels. 5 
Quantitative Tests and Scales:— The scoring of quantitative tests 
either by thé ‘rate’ or ‘power’ systems, or by a mixture of thése, has 
already been described in Chap. П. It will be noted that scarcely any 
tests are available much above the 14-year level, and that very few 
deal with subjects outside the 3 R's. One obvious reason for this is 
that secondary school curricuf vaty so widely, and that only rela- 
tively small numbers study a foreign language or a sciente, etc., by 
comparabié methods а! comparable levels. Nevertheless it woyld not 
be difficult to devise and standardize tests in a number of subjects 
suitable fðr some fairly homogeneous groups of pupils or students, 
such as those in modern or in grammar schools, in teacher training 
colleges, or provincial universities. Certainly our teste lag far behind 
those available for similat purposes in América. 


я * 4NTELLIÓENCE TESTS | В 
Intelligence 4ests closely resemble quantitative educational tests 
(to which they are, of course, historically prior) In their construction 
«апа scoring. Thus, they may consist of: ° 
(а) A series of fasks graded in difficulty (power plan). K 
(Б) Tasks euch a$ performance tests where the speed or number 


pf moves are recorded (rate plan). 2 
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(c) Sets of questions which are roughly graded, but where the 
‚ score depends on the number answered within a certain time limit. 
"Group tests gerieraily include a ‘battery’ of some half-dozen (be- 
tWeen about four and twelve) *sub-tests,' each containing some twenty 
_ (between ten and fifty) questións or items. Alternatively they are 
arranged in ‘omnibus’ form, where items of all kirtds are mixed up. 
Та the battery type, each sub-test has its own instructions and sample 
items, and is timed separately. In the omnibus form, the instructions 
and samples may be printed along with the questions, or included 
in a preliminary practice. sheet, and one time limit suffices for the 
whole test. An omnibus test is therefore somewhat easier to give, but 
a test battery has the advantage of providing rest pauses every few 
miautes. 

Almost all inteliigence tests thus consist of rather large numbers 
of items. Performance tests are exceptional, but it is desirable to 
apply several, say six or more, of these. The tests are always applied 
and scored in а standard manner, and the scores either take the 
form of Mental Age years (as in the Binet and some other tests), or 
percentiles or standard score I.Q.s (cf. p. 67). 

The Innateness of Intelligence.—Many cf the standard textbooks 
describe intelligence as a general innate capacity underlying all our 
abilities, dependent on the genes that we inherit, and therefore fairly 
constant throughout life. Thus intelligence tests differ, it is said, 
from attainments tests in that they consist of tasks whose solution 
depends upon native ability rathtr Шап upon acquired experience 
or upbringing. This view, however, seems rather unrealistic. For 
although there is ample evidence of the existence of inherited 

~ differences in intellectual potentiality, we are quite unable to observe 
or measure such a capacity. The good or poor intelligence 'which we 
observe in evenyday life, at school or at work, and which we try jo 
measure by our tests, is the product of innate factors and епуігоп- 
ment. As Hebb (1948) апа Piaget (1950) have shown, the young 
child gradually builds up or acquires his perceptions and habits and 
Eis thinking through contacts with his physicdl and social epviron- 
ment; and the level that he reaches.at any age is affected considerably 
by the kind and amount of intellectual stimulation that the environ- 
ment has provided up to that age. This level is fairly stable under 
normal conditions of upbringing—more so than would be gathered 
from the publications of some left-wing theorists ard of some 


extremely environmentalistic American investigators. But it docs , 
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fluctuate more ¢han would be expected from the statements of some - 
of the early pioneers of mental testing. Far example, over the prim- 


ary school age range from 6 to 11, or over the range from Îi to 18 


or so, the correlation between two similar dests is very far from per- 
fect, but it does not Sink below + 703 Accórding to the Probable 
Error of estimate techntque (p. 127). this means that the average or^ 
median child alters by about 71.0. points up or down on retesting. 
Many are more stable than this, but a very few show much larger 
gains or losses of 30 or more points (4 x PE); and 17% afe'liable 
to alter 15 or more points either way.' At the same time the Stability 
is great enough to allow us to make useful educational or vocatiorfal 
predictions, for the majority of children, over зеуега] yeárs. Par- 
ticularly striking evidence of this fact is provided by Terman and 
Oden's (1947) follow-up of children with very high I.Q.s over 55 
years. A fullér discussion of these controversial qhestions, and of 
the psychological nature of intelligence, may be found elsewhere 
(Vernon, 1960). . НА 
Thus а better вуау of looking at intelligence and attainments is to 


say that the former refers to the more general qualities of thinking— а 
Ф 


comprehension, level of goncept development, reasoning and grasp- 
ing relatiohs—qualities which seem to be acquired largely in the 
course of normal development without specific tuition; whereas the 
latter refer more to knowledge and skills which are directly trained, 
and whose acquisition and retention depend more on the person’s 
interests and on his personality traits such as induStriousness, There 
is no sharp dividing line between them. Indeed some items such as 
vocabulasy are to bé found either in intelligence of in English attain- 
ment tests. Again some,of the items in the Binet and Wechsler , 
scales, апа in such well-known group tests as Army Alpha, are 
based on general knowledge or information. Although such know- 
ledge 4s obviously acquired, these work very well as measures of 
intelligence, ѕіпсе ће more intelligent are more likely 46 pick it up, 
in or out of school, анд to ат it.? Т.О. and Е.Ө. always show very 

1 With repeated retestings variations are greater, and they may be increasgd 
by numerous factors such as above average ability, ‘Standard deviations greater 
than 15, inaccifrate norms, tor practice and coaching effects. They are much 
larger, also, if the first test is given before the age of 6. Indeed so-called develop- 
‚ mental quotients obtaiffed from tests of 0 to 24-year-olds have virtually no © 

predictive value for'dater Г.О. E 

з A more serious criticism is that giels and women tend 1 "Бе confiderably 

poorer іп gentral knowledge éhan boys and men; and it is preferable to use items 

„ for intelligence tests which show по large sex differences, 
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close overlapping, though some children do come out markedly 

, better op one than the other; and investigations such as Burt's (1937) 

» and Schonell's (1942) into educational backwardness reveal that 
those with relatively low, attainment are usually handicapped by 
„bad home conditions, poor health, ineffective teaching, or emotional 
maladjustment, ora combination of these. Thus even'though we must 
admit that intelligence and the I.Q. are only partly innately deter- 
- mined, it is still true that they give better indications of capacity for 
acquiring further education than do present attainments alone. 
Intelligence tests are more fair to children from remote rural 

‚ schools, and others whose schooling has been affected, e.g. by 
absences, and to children of inferior background ; although all these 
factors themselves do have adverse effects on the LQ. 

The majority of tests employ the verbal medium since man's 
higher thinking and reasoning capacities generally involve the use 
of words, numbers, or other-symbols. Moreover one of the most 
important, if not the most important, use of intelligence tests is that 
of predicting scholastic aptitude, and as school work is itself pre- 

> domjnantly verbal, verbal tests are certainly the most useful. But 
this means that people's performance at sach tests is considerably 
affected if they have had inadequate opportunities for acquiring 
language, for example the foreign-born or bilingual, the deaf, or 

` children of gipsies or canal bargees who receive little or no schooling. 
Many testers therefore prefer non-verbal tests, consisting of prob- 
lems based on pictures, abstract diagraihs or concrete (performance) 
material. While these may be more fair to the verbally handicapped, 
they cannot be considered as giving a truer measure of innate 
» ability. A British or American child, for example, has much more 
opportunity than an African child for acquiring skills with "pictures 
and blocks, anda middle-class British child probably has advantages 
over a lower worhing-class child. Nadel (1939) and Biesheuvel (1 949) 
show convincingly the fallacies of assuming that гасја! differences in 
intelligence can be theasured by such tests, or indeed that any type 
ofitest is ‘culture free.” Tests of the concrete бг pictorial type are 
valüable not only for the deaf and the illiterate, but also for younger 
children who are less interested in verbal tests and less accustomed 

| to verbal tasks than older persons. But these, and other non-verbal, 
tests give decidedly poorer predictions of educational capacity, 
except possibly ir the technical, scientific, and mathematical fields. 

Individual Tests.—Up to 1937, the majority of testers of individual 
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children used th Stanford Revision of the Binet-Simon scale} pub- 
lished by Terman in 1916, or Byt’s unpublished restandardjzation 
of this Revision. "Тһе New Stanford Revision, of Terman-Merritl 
scale then became the most popular versign (Terman ahd Merrilt, 
1937). This is superior both because if contains very full instructions 
for application afid scoring, which ір to obviate the confusions 
and errors previously so common among inexperienced testers, amd 
because of its greater extensiveness. It is issued in two forms, L and 
M, both of which cover almost the whole rangé of intelligence from 
that of 14-year children to that of adults. At the same time it has its 
“difficulties and weaknesses: SEE 

(а) The original American versionf cannot beeapplied as they 
stand in this country, Because they are replete with words and phrasss 
that require ‘translation’ into English. Some of* these have been 
altered in the edition published here, but many other desirable 
modifications have been suggested. Of the latter, the simplest and 
the most accessible are those prepared by the Scottish Council for 
Research in Edu€ation (Kennedy-Fraser, 1945). Again, the typical 
responses quottd in the scoring instructions are of little use sigce, 
even after translation, they represent the thoughts and expressions 
of American children. The tester must try to decide whether his 
testees’ responses correspond in intellectual level to these accept- 
able and unacceptable responses. . 

(b) None of the British versions has been restandardizéd. Thus 
the tester must expect to find fhat Папу of the tests are misplaced in 
order of djffieulty. Children may fail all the tests in опе year, and 
yet pass оће or two іп a higher year. The same applies to items.within 
a test. Thus the Vocabulary words and some of the Absurdities, 
etc., require rearrangement. 

éc) It is djfficult to obtain toys and other practical material, 
needed for tests of younger children, which are precisely comparable 
to the originals,"and this*material is very expensive. Fortunately it is 
used only at Mental Ages of*4,and less. For alrfiost all clfildren of 
primary school age, “the printed cards and beads and blocks a 
Sufficient, e . . RA 8 

(4) Most of the tests near the top end of the scale, labelled Ayer- 
age or Superior Adult (M.A.s 15-22), are too childish in content to 
be suitable for adults, although they are useful for discrimimating 
among children of above average and very superior intelligence. 


e (е) Despite the care taken over the Атйегісап standardization, the 
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. вазу, and testers complain that far too many adolescents are getting 
high 1.0.3 in the 130-160 range. It may be that these queer results | 


‚ different age levels (cf. р. /1); and Fraser Koberts and Mellone 
(1952) supply a useful correctjon table which allows for such varia- 


‘cludes 5 or 6 verbal sub-tests and 5 performance tests, and thus yield 


~ included ір the handbook, or can easily be,constructed by the tester. 
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tests from about 11 years onwards appear to be considerably too | 


avise merely from variations in the Standard Deviation of I.Q.s at 


tions. But, as pointed out earlier (p. 73), no test based on M.A 
scoring can be expected to yield really satisfactory results. 

Тһе Wechsler imtelligence scales surmount several of these | 
difficulies, The original Wechsler-Bellevue test (Wechsler, 1939) is 
now replaced by W.A.LS. (Adult Intelligence Scale), which ha 
norms trom 17 fo 75 years./The W.I.S.C. (Intelligence Scale (ог И 
Children) is applicable from 5 to 15-- years, but is not—like | 
Térman-Merrill—suitable for pre-school children. Each scale in- | 


a verbal and a performance, as well as a combined, 1.0. Here too 1 
seme adaptatións are needed for British testees, but standard modi 
fications have been prepared by the British Psychological Society.” 
The pictures, blocks, etc., needed for the performance tests аге un- 
avoidably expensive, but the standardizatica and scoring are so much | 
superior to those of any other scale that most psychologists cons 1 
cerned with individual testing of children and, especially, of adults | 
prefer the Wechsler. m 

Valentine's Intelligence Tests for Young Children (1948) are 
particularly suitable for 2- to 5-year-elds, though also providing а 
short battery of individual tests for children up to about 11 years. 
This scale has the advantage that all the necessary materials are 


For 0-2-year-olds, Griffith’s (1954) Mental Development Scale isthe | 
most thorough, and the best standardized in Britain. It yields | 
separate quotients in each of five fields of development—Locofnotor, - 
Personal-social, Hearing-speech, Eye and Hand, Performance; and 
a combined or general quotient. In testing at this level it is especially | 
iraportant that the tester be thoroughiy trained and experienced in 
managing babies and using the correct materials and instructions. 
Performance Tests.—As we have seen, these practical tests may be | 


1 The original version may, howevei, continue to be used for some time in this | 
country, partly because more thorough English adaptations have been worked | 
out, and partly because it normally takes only 45 nünutes to аррју as compared 
with over 60 for W,A.LS. Э 
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used when verbgl ones are inapplicable, or they may tisefully supple- 
ment the predominantly verbal Binet-type test. Their chief merits are 
their greater attractiveness to chifidren, and their capacity for bring- “e 
ing out a variety of temperamental reactions. A testee' manner Qf 
approach to such tests, and his behaviour when confronted with 
difficulties, throws mueh light on his impulsiveness, „persistence, ^ 
complacency, and Other qualities, Which cannot as yet feadily be 
tested or measured objectively. У » 
The majority of performance tests suffer fom the practical éraw- 

backs that they' are very unwieldy and difficult to transport, and 

e extremely costly. Some testers are suspicious of them because 
different tests often give very inconsistent results. Acchilde with a 
Binet M.A. of 9 yearg may, for instance, pass performance tests at 
anywhere from the 7- to the 12eyear levels. This is partly due to poér 
standardization, and to variations in the actuale materials and 
methods of application. As they have to be applied individually, the 
task of providing accurate British age norms is а difficult опе (cf. 
Vernon, 1937). Eyen if they were perfectly standardized, we should 
still expect congiderable inconsistencies, since their inter‘correlations |, 
show that each test involves a large specific factor, or unknówn 
group factors. Many of them are also poor in reliability, and the 
evaluation of this feature is difficult because of the large practice 
effects if they are applied twice to the same testees. Thus the mental , 
tester should never place much reliance on a single performance test, ` 
nor on two or three. A gmallnumber шау be sufficient to indicate 
whether a Binet 1.Q. is likely to have been distorted by some verbal 
defect, byt the median or mean score from at least half a dozen, or 
the total result from a battery such as Drever-Collins, УРАЛА. ог 
W.LS.C. «should be taken if a reliable performance test М.А. or IQ. 
is desired. S ij т 

Undoubtedly then their validity, either as measures of g or of 

‘practical ability,’ is poor. True, we have ‘seen that they may involve 
the spatial abilify factor, le, to some extent, and so be useful in 
selection for mechanical осойрабопв айа technical courses. But 
there i$ no real justification for the view that they give a better 
indication of intelligence for daily life purposes than the more 

. reliable verbal type ef test. It is interesting to note that several of „ 
them are markedly affected by emotional maladjustment and neuro- 
ticism. Children tested at a Psychological Clinic tend to make poorer 
scores than on Binet. Delinquents, on the other hand, who are often 
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very backward educationally, do tend to do comparatively well on 
performance tests (cf. Earl, 1939). 


* 


GROUP TESTS AND INDIVIDUAL TESTS 
У The merits and defects hoth of individual.tests such as Terman- 


‚ Merrill, Wechsler and performance tests,,and of group verbal or | 


non-verbal tests, may best betrought out by contrasting them under 
не following series of headings: 
1.,Age Range.—Individual tests must be used with very young 


children because it is practically impossible to hold the attention of 5% 


а group, or to get them 4ll to act alike at the same moment. Further, 
it is псе natural for children to apply their intélligente to symbols on 
paper—either words, pictures or diagrams—until they have settled 
down to formal, schooling. Manipulation of concrete objects or 
individual conversation with the tester constitute much more appro- 
priate testing media. Group tests are available which claim to 
measure Mental Ages down to 6 or even 5 years. The instructions for 
these must, of course, be given orally. But their value is very doubtful, 
for the above reasons. They may sometimes be useful for classifying 
pupils aged 7 +, i.e. children whose M.A.s are likely to run from 
5 to 9., * E 

By 9 or, better, 10 years, children can read instructions for them- 
selves, are thoroughly accustomed to class work, and can readily 
deal with a variety of verbal problems. This is therefore a very suit- 
able age at which to begin group, testing. Above 14, group tests have 
definite advantages over individual, both because they can readily 
be made difficult enough even for the highest iatelligeñco levels, and 
because their items are much less apt to appear silly and childish to 
suspicious adolescents or adults. Nevertheless, the W,A.I.S. ог 
W.LS.C. are preferable for abnormal cases who would not settle 
down readily to the group tést situation, e.g. the feeble-minded or 
patients ina mental hospital or clinic. ^, 2) , 

2. Ease of Application —Group tests are easier to procure, to 
apply and to score, and they demand Jess training and experience on 
the part of the tester: Possibly, however, these are disadvantages. | 
We assume too readily that group test performance is unaffected by 


the tester and the atmosphere he creates. As shown in the next — 3 


chapter, there are many errors of application and jhterpretation that 
inexperienced testers often commit. > Я 
3. Time,—Néedless to say, the time Saved by applying group | 


а 
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instead of individual (6515 is enormous. For instanct, a class of forty 
can usually be tested in less than an hour, and their answers scored 

in four to eight, hours, whereas an experienced, Binet tester needs *, 
about half an hour for each young child, three-quarters for an older 
one, and ah inexperienced tester mych.lofiges still. The point, how- 
ever, is whether this period per child is*worthwhile. The majority of 
psychologists wouľd certainlyesay fat it is, and that arf individual 
test can reveal much better than a group test things which" the 
child’s teachers and parents may have failed to recognize even after 
years of acquaintanceship. Probably the best compromise would be 
for a school psychologist or infants’ mfistress to test each chjld 
individually at 6, fof a group non-verbal or oral testo be-given in ” 
class at 8, and verbal group tests at 10 and 14 (assuming that the 
results of selection tests are awailable at 11 +). Such results wouldebe 
entered on а cumulative record card, and would'provide a very $ 
reliable picture of each child’s intellectual development. Individual 
tests would need to be given from 7 onwards only to exceptional 
pupils whose grgup test results were highly irregular, or about whom 
some importarit decision had to be made for purposes of educational | 
or vocational guidance. # 

4. Time, Limits —Most of the items in the Binet scale and many 
individual performance or educational tests are untimed, whereas 
the great majority of group tests have to be done under speeded | 
conditions. Are not, the critics ask, some children slower but surer 
than those who score well gn these timed tests? As indicated pre- 
viously (p. 132), the answer 15 more negative than positive, but the 
questionis ‘a complex one. If time allowances on speeded tests are 
increased, slower children certainly improve their scores, ‘out they 
still generally remain poorer than quick ones. The rank Order is but * 
little affected and (according to the principles of Chaps. П and IV) 
itis rglativey not absolute, scores that'matter. At the same time experi- 
ments do show that yith highly speedgd and rather easy tests, 
children's performance is considerably affected by their, quickness 
and by whether they, aim primarily at speed or at accuracy ; whereas 
with very difficult, untimed material their persistence or willingness 
to go on trying may affect their results as much as does their ability. 
For most purposes it is desirable to compromise between ahese , 
extremes, and tq avoid mixing up our measures of ability with what 
are essentially temperamental oF work attitude factórs. Further, it 
would be administratively most inconyenient lo use untimed tests, 
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which different testees would take very different times to complete. 

Hence it is possible that some of our present group intelligence and 
' attainments tests (particularly in aithmetic) are unduly speeded. If 
we inclined towards tests with longer time limits and more difficult 
items, very few children would alter their relative scores appreciably, 
but it is likely that our prédictions of future educational achieve- 
ment would be slightly improvéd. We know, too, that emphasis on 
speed unduly handicaps Older persons, say 30 years upwards, hence 
ample time limits аге proferable among adults. It may be that it also 
upsets 2motionally excitable or nervous children, ‘and this is one 
reason why Clinic psychologists prefer to use individual tests where › 
the teste gives his answer at his own pace. 

5. Objectivity.—With good group tests the conditions of 
application and tbe scoring are mucly better standardized than with 
most individual tests. Very little is left to the personal decision of the 
tester, and the marking is, or should be, completely foolproof. In 
actual practice many group testers still manage to commit gross 
errors, some of which will be pointed out in the next chapter. And 
modern individual tests like the Terman-Merrill and Wechsler afford 
much less scope for subjective whims than did the older ones. 
Nevertheless individual testers often have to judge the cozrectness of 
doubtful responses, and may, without departing from the instructions 
in the handbook, alter the testing conditions іп a variety of ways. Yet 
there seems to be no evidence of serious discrepancies between the 
results of trained'testers, although untrained ones may make grave 
mistakes, even sometimes leading to the wrongful certification of 
mentally defective children or adults. Reliability coefficients over 
short periods are very high, though greater difficulties leading to 
lowered reliability occur in testing children below the age of 5, and 
among maladjusted children or adults (cf. Tulchin, 1934). The ex- 
perienced tester should, however, be able to recognize when his 
results are liable to be aflécted by extreme unco-onerativencss ог 
instability in the testee, and to treat them with düe caution or to 
arrange for a re-test when desirable. ' - RU» 

Actually the rigid objectivity of the group test is distrusted by 
Clinic psychologists, since they know that the same physical situa- 
Чоп may mean very different things to diffsrent children. The 
flexibility of individual testing isto them a great advantage which, so 
- far is known, they do not abuse. ' 

6. Dependence on Health and Mood.—Both group and individual 
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tests seem to be much less.affected by these factor$ than mjght be 
anticipated. Experiments show little and often no difference between = 


the scores of childten when they'are fatigued or ill, and when‘in good *, 


health, nor, between those «who are strongly motivated, (e.g. by the 
promise of monetaryerewards) and those who are merely encouraged” 
іп the ordinary way. Studies of the latter type sometimes show that" 
testees who are making more effort Жау attempt more test items, but 
they also make more mistakes so that ‘their scores are scarcely 
altered. There seems then to be little evidence for such statements as 
that the ехатійайоп conditions of group testing make some testees 
unduly nervous, or that others are stimulated to do better by the 
competitive nature of the testing situation. All such conclusions, 
however, are based рп average tendencies, which admit occasional 
individual exceptions, and Binet testing certainly indicates that a few 
testees are so resistant or timid that they fail to do themselves justice. 
More extreme differences in Attitude to the test can undoubtedly 
distort the scores, and certain physical conditions such as bad eye- 
sight must have«a like effect. No intelligence test possesses perfect 
reliability; and conditions such as these may bé at’ least partly 
responsible. Variations jn group test performance, as well ds in 
individualyare known to be greater among maladjusted children. Yet 
at the same time there is no evidence that excitement or anxiety 
upsets the results of 11-+ selection tests among more than a small 
minority. | . 

The great advantage of individual over group testing is that in the 
former the tester can generally discern the influence of abnormal 
conditions, whereas'in the latter he has scarcely any control over, ог 
insight into, them. It may, be that while high scores on group ‘tests are 
relatively stable and significant, low scores can arise from a variety 
of sources other than poor intelligence. However, & lot more careful 
réseasch is needed in order to determine definitely the part played by 
such factors. » * а "e i 

7. Dependence on Practice or Coaching.—Most intelligence tests 
are published, hence anyone'can procure them, coach testees on 
them, апа ensure considerably increased scóres. It is a most fooljsh 


* 


thing to do, since а child who is'made to appear unduly intelligent 


may also legitimatety be expected to be capable of unduly difficult , 


school work. But whenever examinations for secondary school 

entrance, or for other competitive purposes, include intélligence 

tests, teachers and parents who wish the children to do well will 
М.А.—12 Lj ^ 5 
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naturally be tempted. In individual testing it is fairly easy to recog- 
nize when children have been coached or practised, since their glib 
answers are very different from ійе answers of children who are 
puzzling out the tests for the first time. It is less easy to know what 
allowance to make. In group testing, on the other hand, one may 


' suspect that-scores are too high all round, though one can never be 


certain unless (as has actually"occurred) a whole class answers 19 
test items out of 20 correctly, the twentieth being one where the 
teacher was mistaken. » 

In order to prevent direct coaching, most Educational Authorities 
obtain tests which are freshly constructed and privately standardized 
every year by Moray House ог other organizations. But copies of 
older, more or less parallel, versions are readily available, and there 
ate large sales of booklets which contain test items very similar to 
those in the unpublished tests. Particularly at the 11+ stage, and in 
areas where the provision of gramntar school places falls far below 
the demand, intensive coaching in schools, at home, or by outside 
tutors, is known to be widespread. Not only does this tend to distort 
the normal School curriculum (cf. p. 199), but also it produces severe 
strdin and anxiety among many of the pupils. 

The topic has received considerable publicity, and numerous 
investigations have been carried out. These show that coaching and 
practice on intelligence test materials do make an appreciable 
difference at the selection borderline, though they are nothing like as 
effective as parents and many teachers imagine. Here we should 
distinguish. practice in taking complete intelligence tests, without 
further instruction or knowledge of the right answers, fron coaching, 
where ítems are explained and hints given. A single previous practice 
test raises the average І.О). by about 5 points, and further: practices ' 
yield diminishing returns up to a total of about 10 points. After 
some five tests there is no further improvement, and I:Q.s fluctuáte 
irregularly or even begin 4б decline, When combined. with coaching, 
the effects are somewhat larger, the avezage maximum rise being 12 
t^ 15 points. But these figures refer io children cr adults who һауе 
never seen a group intelligence test before. When, as is usually the 
case nowadays, they are fairly familiar with tests, the effects may be 
only half as great. Moreover it has been shown that the effects are 
highly specific. Thus the kind of coaching which iz given from pub- 
lished books of miscellaneous items, and which does not include the | 
taking of a complete parallel test under examination сопа оћз, is 
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remarkably ineffective, protlucing "barely a5 points rise. Again, the 
maximum improvement occurs remarkably quickly; if pragtice or , 
coaching amount (о more than a few hours s they stem to defeat their." 
own ends. e е • 
We have admitted’ earlier that вдо schodling and home back- 
ground do вепитеју stfmulate the development of intëlligence. Bu 
it is clear, from the facts just stated, that intelligence cannote be 
‘taught’ in at all the same sense as school Subjects are taught. Prac- 
tice and coaching merely provide a limited facifity in dealing with the 
rather roundabout and artificial types"of items that occur in most 


and in the concentration of effort. Thus they have distinctly less 
effect on the more natural types of items used in individual tests, 
where the testee supplies his own answer in his owh words, than they 
do on most multiple-choice or recognition-type group test items. 

In areas where coaching is at all common, the only fair solution 
seems to be to allow a limited amount of practice and instruction on 
parallel tests in the schools. This will give uncoached children’ a 
reasonable familiarity with the sort of tests they are to take, and, will 
generally bring them to the level of the more intensively coached. 
But it would be better still if tests were used primarily for guidance 
rather than for selection, since then the incentive to coach would be 
removed. At the same time, the mental tester can pever afford to 
ignore practice or coaching; the amount and kind of previous cx- 
perience of tests that testees lave had does make a difference, For 
example, in any experimental investigation involving the ¢ comparison 
of two ormore tests; the scores on those taken later will be to some 
extent affected. The edueational psychologist should realize that 
published test norms may be inapplicable to children who are 
unduly ‘test- sophisticated,” and should usually avoid giving tests in 
any school at intervals of less than, say, імо years. e 

8. Validity—*Thé vartous versions of the Binet test, as already 
mentioned, measure a “poorly analysed hotch-potch of abilities, and 
they һауе been sevérely criticized by Spearmap, Cattell and others dn 
account of the weaknegses of their statistical construction atid 
theoretical foundations. Burt (19395) likewise points out the hetero- 

* geneity of item content. Perhaps still more serious are the variations 

in content at different age levels; for example, the verbal bias in 

Terrgan-Merrill becomes much more pronouneed*at the top end. 

Thus the scale cannot be said to be measuring the same intelligence, 
LJ 
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group tests, in understanding and following the instructions quickly, . 
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or even the same pattern of abilities, throughout. A further e- 
quence of this heterogeneity is that jt tempts testers to make dubious 
sübjective analyses of types of items, as when they note that a child 
does especially well on the more practical items, or badly on the 
memory items, and jump to the conclusion that he is a ‘practical’ 
chiid in daily life, or has poor ‘memory’ for sthool work. The 
experienced clinical psychologist can obtain many suggestive hints 
as to, qualitative features of the child’s intellect and temperament 
from his responses to particular items, or groups of items, but they 
are only hints which require to be followed up in far greater detail 
before they can be regarded as of any diagnostic value. Qua test, it 
only yields the оле general score, which can be regarded as rather a 
vague mixture of g, v : ed, and other factors. * 

In contrast, , the well-constructed group test is highly homo- 
geneous, each of its sub-tests or types of items involving much the 
same factor pattern. Spearman would ћауе claimed that it was an 
almost pure measure of g, though we realize now that the content 
and form of the items, together with difficulty Mevel and timing 
conditions, do introduce other factors. It seems paradoxical, then, 
that most psychologists prefer the relatively unscientific Terman- 
Merrill test when they are called on to make any decision involving 
the intelligence of an individual child. They regard it as their most 
valid mental measuring instrument, and as giving the best avai able 
index of a child's general intelligence. That less trust can be put in 
group test scores is shown by ihe fact that different group tests 
often give somewhat discrepant results. If they are fairly similar in 
make-up the correlation between them (over a few days ог weeks) 15 
likely to Be between 0:8 and 0-9 in a representative group. But with 
greater diversity, as between verbal and non-verbal tests, the correla- 
tion may sink'to 0-7 or below, and discrepancies between а с 11078 
twp I.Q.s may reach 30 points or more, though the average or ‘median 
difference should still not exceed about 7 points. At the same time, 
group verbal tests actually give at least as good predictions of future 
educational standing 2s do the Terman-Merrill or Wechsler, among 
normal children aged 10 upwards and adults. The superior validity 
of individual tests is thus partly illusory, though it probably does 
hold good for younger children, particularly in regard to intelligence 
outside the school. um. T 

The probable explanation of this situation is that the intelligence 
which we recognize in daily life, in school or out of it, is not pure 8; 
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but a mixture of abilities. Binet and his followers chôse items largely 
from children’s everyday behaviour and thinking, regardless of their 
factor content, ара the very imptrities that they included ћарреп fo “s 
overlap very well with the mixture we want. In the group test, hoy- 
ever, the items are mostly of a mow restticted and artificial type, * 
though they happen toecoincide fairly Closely with the kind ofsin- * 

'  telligence required for passing exanffhations. The child is forced to 
express his abilities through a less natura medium, and this inffo- 
duces such components as facility-in-doing-multiple-choice-stems- 
at-speed, which"are irrelevant to anything we want to prediot. 

, A further consideration, especially among younger children and 
among the more unstable or abnormal testees of any age, is that the 
group test yields only,a numerical score, with none of the additiogal 
indications which are so useful in interpreting individual test fe- 
sults. The Biflet tester has much more effective control over the 
testee’s state of mind, He ensures favourable co-operation, and can 
judge, albeit subjectively, whether his responses аге typical ôf the 
best that he can glo. Despite the small effects of mood, health, ete., 
on group test scores, there are likely to be many uncontrolled factors 
which upset these scores in individual instances, and which partly 
account for the imperfect correlations between different tests, It is 
artificial, again, to separate off intelligence as a purely cognitive 
variable from the rest of the personality. Normally it always operates, 
in a certain emotional and social context. The group tester tries to 
ignore this; but the individua] tester, by observing the child's man- 
ner of approach to difficulties (particularly those offered by per- 
formance, tests), from the kind of test response andefrom general 
conversation, can build up a picture of the child’s whole personality, 

> and see how that personality applies his available intelligence. Thus 
he obtains a much more complete and practically ha view than 
can begot from a more scientific measure of ability factors, though 
admittedly this view is liable to subjective distortions. ° 
The present Writer ayould hold that the most, reliable {esting of 
individuals must be done individually. But instead of one scale of 
somewhat indeterminate and variable contefit, he would prefer to 
see developed'a series of Shorter ahd more distinctive scales or sub- 

‚ tests. Each sub-test should be shown, by factorial studies, to measure E 
as homogeneous ап ability as possible, if necessary over a limited age 
range only. Such abilities need mot be pure factors. ‘Indeed over- 


lapping is unavoidable, ‘as they would all involve g, and possibly 
x LJ 
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other factors, in common. But they should cover as wide a range of 
mental functions as possible, and so enable us to carry out more 


e penetrating and more scientific dfiferential diagnoses. Instead of 


© 


ee 


» 
б 


merely summing the scores to provide 2 measure of gegeral intelli- 
gence, we should coniibine the results of several scales with appro- 
priate weights to measure, say, intelligence ineveryday life situations; 
a.djfferent combination would give aptitude for academic ed ucation, 
another for technical education, yet another for mental deterioration 
in different types of psy¢hotic and brain-injured patients, and so on. 
Such weightings would be based on actual follow-up research, not 


ой subjective judgments of what the scales measured, The Wechsler ; 


tests already"go some way in«this direction, but their sub-tests tend 
tovoyerlap too highly, and they are mostly too unreliáble to yield 
useful differentialdiagnoses. They give a good verbal 1.0. and a 
total (verbal +-"performance) I.Q.; but attempts to use patterns of 
scores, ог weighted scores, on the different sub-tests for revealing 
various types of mental abnormality, seem to have broken down. А 
number of batteries of so-called differential aptitude tests have been 


• published in America, descriptions of which are given by Anastasi 


(1954). Though some of these give promise of being more useful in 
vocational guidance than a single general test, they too’suffer from 
insufficient reliability and too great overlapping. Thus the develop- 


. ment of a really satisfactory set of tests, beyond the stage that 


» psychologists and teachers. These and other tests from such countries , 


‚ as NFER). 


. 


Wechsler has reached, is by по means easy. 
. % 
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em A 
LISTS OF TESTS AND METHODS OF MEASUREMbNT 
2 IN EDUCATIONAL PSYGHOLOGY 


The following list includes most of the mental tests available in 
Britain at the time of writing, together with some particulars»of age 
ranbes, time requirements, publishers, prices, eto: A number of older 
tests, little used nowadays or out of priht, ае excluded ; so also are 
most tests whose sales аге restricted to Directors of Education ог 
other specially qualified persons. A few tests published in Australia 


or in U.S.A. are included which seem likely to be useful to British 


can ggnerally be ordered through the Nationa?’ Foundation for 
Educational Research in England and Wales (referred |0 in the list 
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Publishers—®ublishers’ names are abbreviated as follows: 
АЕ Austrglian Council for Educational Research . 
В Baird Scientific Instrument Со., Edinburgh | 
CL Crosby Lockwood e e e: ii 


С Gibson, Glasgow е ко i 
H Harrap a Б 5 
ICP Institute of Child Psychology. .. 


ІРАТ Institute for Personality and Ability Testing! 
L Н. К. Lewis, London 


LG Longmans, Green TUS ~ А 
M Methuens A e 
NEER National Foundation for^EducationaleResearch 
NI National Institute of Industrial Psychology 5 
N Nelson 9 
ОВ Oliver and Boyd, Edinburgh 
PC Psychological Corporation, New York | . 
SRA Science Research Associates, Chicago : E 
U University of London Press e e 


Prices.—1he prices quoted are those current in 1955 for 100 copies 
(including purchase tax), unless otherwise indicated. Naturally these 
are liable to fluctuation. Group tests are often available in dozens or 
sets of 25, at slightly increased relative cost. Single specimen copies 
with manuals are usuallysssugd at about 2/- per test. Ап asterisk (*) 
in the prices column indicates that the answers can be.written on 
blank paper “ог special answer sheets, or given orally, so that the 
printed blanks can be used over again. Where no price às Quoted, 
the necessary material is given in the book referred to. 

Times.— When a definite figure for time limit is given, this usually 
méans,the aetual working time, apart from instructions, etc. An 
approximate figure.e.g. 20-30, or c. 25, shows the number of minutes 
including instructions,«or indicates that the time yaries with different 
testees. Кес 25% А E 

Age Ranges.—These are given in two columns headed C.A., aud 
M.A. The former inditates that percentiles standard score or 
other norms are available for all children in this range, whereas*the 
latter implies that mean (or median) test scores only are listed for 
certain ages. Thus«a test with С.А: 10-12 covers a wider range than 


е ° 
1 Agent, Mrs. М. M. Barnes, The Grammar'School, Henley-on-Thames. 
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one with M.A. 8-144, since the latter will not discriminate below 
1.0.80 at 10:0 years, nor above Т.О. 120 at 12:0 years, 

- References to books or articles where descriptions of, or norms 
for, the test»may be found are generally shown, by date, after the 
,author's, name. Some additional references and notes are given in 
small type, Readers who require a detailed critical evaluation of any 
American published test should ‘consult the latest edition of Buros's 
(1953) Mental Measurements Yearbook. This, however, covers only 
a proportion of British dnd Australian tests. 
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1. Choice of a Test.—In choosing a test for some particular purpose, 
the tester's first consideration should, of course, be its validity for 
that purpose. As we have seeh in Chap. IX, he may be able to judge 
ə its'validity to some extent by inspection of its content, but he v 'ould 
be better advised fo seek out experimental proof of what it is that 
the test measures. Investigations by its author, oz by others who have 
” used the test, should be consulted. THe mere fact that it is called а 
% test of such-and-such an ability does not constitute evidence. The 
second, consideration (which greatly affects the first) is the test's 
reliability. Genérally speaking, a test must be fairly long, and its 
scoring must be objective, if it is to be satisfactory in this respect. 
»- Most authors of good tests publish evidence of their re'iability in the 
accompanying manuals. Little trust should.be placed in tests whose 
scores are not known to yield a reliability coefficient of 0:90 or over 
in a representative group—say, a complete year-group of children. 
’ 2. Age Range,—Few, if any, tests presume to cover all ranges of 
ability, and most educational and intelligence tests are only intended 
to cover quite a limited range. Tlius tlie tester should be careful to 
choose from'among the available tests those which will, when applied 
toa large group of his testees, be likely to yield a normal distribution 
» of scores, and which will not unduly curtail this distribution at either 
end. For instance, іп testing an ordinary school class, he should first 
estimate roughly the range of M.A.s or E.A.s which it contains. A 
class of average С.А. 10 yéars may often include individuals with 
M.A.s (on a group test) of 6 to 14 years, though*the-great majority 
should lie between 7$ and 124 years. He should then study the ages 
for which the various tests provide nornis, and pick one which covers 
as nuch of this predicted range as possible. If he has reason to believe 
that the average М.А: ог Е.А. of the class will be, perhaps, a year 
above or below its average C.A., he must correspondingly adjust his . 
clioice ef test. . ТА 5 
The ‘differentiating power’ of the test norms should also be noted. 
, They should show a fairly large and even increase in score from опе. 
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age level to theanext. Suppose that'a test containing, say, 200, items 
has the following norms from 6 to 13 years. The increases are large 


up to 9 or 10, but then tail off, НА there is a difference of only’Sitems “e 


between the,12- and 13-уеа? scores. Such test will clearly not dis- 
criminate or differentfate well at the uppe 
of norms із frome6 to 14 rather than (тош 61013. ~ 4 


“ 


Years 52776-54 85.97. 10-10212 ЫИ ЕНЕ 
Score . 40 71 101 128 151, 168 179 184. " 


Similar considerations apply to the ёһоїсе of tests whose"norms 

a are issued in percentile, or other, forms. у s 
3. Convenience of Arrangement and Scorilg.—]f atest is to be 
used extensively, attemtion should be paid to its general arrangement 
from the point of view of еаз@ т applying and scoring. With some 
performance 16545 and tests of special aptitudes, much time is wasted 
in setting out the material, Ог the instructions are very elaborate, or 
the scoring unnecessarily detailed. More moderneAmerican tests 

usually try to redtice these inconveniences. È » 
Group tests of attainment or intelligence may also vary greatly in 
‘mechanics’ or ‘lay-out.’ Some of the older ones have insufficient 
instructions'and sample items or practice sheets, so that testees often 
fail to understand what they have to do, and a lot of additional oral 
explanation is needed. Sometimes the testees have to answer in their 
own words, and the scoring manual never allows for all the possible 
answers which turn up, %o that much time is wasted in deciding 
whether or not an answer shall be counted. The positions of the 
answers afe often scattered all over the page, instead of being ar- 


г énd?and its effective range E 
елны 


о 


4 


° 


ranged in the right-hand margin. True, stencils аге often provided , 


^ which, when laid over the test page, show up the correct answers. 


But, according to the present writer's experience, jt is far more 
trouble to fit the stencils over each succesSive page than it is to com- 
pare the testee!s booklet with a booklet on which all the corfect 
answers are clearly mdđrked.®, • . 

Many Americar gfoup 1е56 аге now issued with an interleave] 
carbon-paper, device, so arranged that all the gorrect answers, and 
none of the incorrect ones, are automatically Shown as crosses on 

+ the back of the page? The scorer then merely has to count the пат- 

„Бег of crosses. Alternatively, testees may check their answers,with & 

Special carkon pehcil; their blanks are then passed through an 

electrical machine which counts instantaneously the num 
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carbon marks їй correct positions, through the amount of current 
which passes. In this country, however, testing is seldom carried out 


' оп so large а scale.as [0 warrant such ingenuity, and its consequent 


expense. , : 2 

* 4. Procedure.—Mauy of the following hints.may seem puerile to 

коше readers, yet they are all based on the writer's experience of 
mistakes actually committed bj ama:eur testers. Such mistakes may 
often lead to quite serious consequences. 

The.tester should first.study the instructions and the test material 
carefully with a view to secing precisely how the test works, and 
skould then follow these instructions to the letter. Often itis worth 
the testér’s while to take the test himself. In Binet testing, the results 
should not be trusted until such time as the tester practically knows 
the instructions, and the essentials of scoring, by heart, and until he 
has given the test to, say, twenty cases. 

5. Modification of Instructions or Maierials.—The inexperienced 
tester is very apt to find flaws in the test or instructions which, he 
thinks, could be eliminated by some simple mediücations. Very 
likely the test could be improved, but then it would no longer be the 
same test, and the published norms would no longer hold good. For 
the same reason a printed group test must not be duplicated or 
mimeographed in order to save expense. Quite apart from the 

‘probable infringement of copyright, the difficulty of the test is 
likely to be altered thereby. 

6. Timing.—Accurate timing is essentiai. In some group tests a 
mistake of 5 seconds may quite possibly lead to an error of 175 or 
2% іп averagel.Q. Testers who do not own stop-watches are advised 


, to use omnibus tests when feasible, since these require only a single 


timing. Tf times are kept by a watch with a seconds hand,.the exact 
minute and second at which a test is to stop should be written down 
as soon as the test has started. Mistakes very easily occur if the 
tester trusts to his memory. E ol 

7. Underlining and Erasing.—Before Лутра group test to a class, 
all. rulers and indiarubbers should be firmly removed, otherwise the 
pupils will waste several minutes in ruling their underlinings, or 
rubbing out responses when they change their minds. lf the test 


instructions do not already say so, the tester should explain to the , 


pupils that, in order to change a response, they should scribble out 


the first choice rapidly, and then mark or underline the new choice 
clearly. У Е у 
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8. Distractiogs and Copying.—In all testing, possible sources of 
distraction should be avoided. For instance, a group test should not 
be given when a sihging lesson 1#йпе to оссигап the next classroom. ^. 
The testees should be spaced as far apart as possible, since it is very 
easy, in multiple-cheice tests, for ene t oferlook and copy the" 
answers of anotker. Often the younger children һауе по intentiof of ° 
cheating, but are yet extremely anxfous to write down the same gs 
their neighbours, simply because the whole situation is unfamiliar to 
them, and they are afraid of doing the ‘wreng thing.’ .. 
9. Practice Sheets and Samples.—1n most group tests fof children 
thereaare preliminary practice sheets or Sample items, where Вер. 
may be given. The tester should make suré that even thé dullest _ 
testees understand what to do, and that they can manage at least one ^ 
or two of the sample items by themselves. > 432» 
10. Соасһійр.-Мо practice or coaching of айу kind, beyond 4 
what is specified, should bé given. Testees who gain ап advantage by 
such means will no longer be comparable with these on whom the 
test was standhrdized, and the norms will be useless (cf. рр. 165-7). > 
11. The ‘Subjective’ Situation.—At the same time as adhering to — 2. 
the standard objective cqnditions of testing, the tester should try to 
obtain, even in a group test, a favourable subjective atmosphere. If 
the tester himself is apprehensive, or bored, his mood is likely to 
communicate itself to a group of children. The testees should not be , 
excited and anxious, nor, of course, lackadaisical, in their attitude 
to the test. Teachers should aot peer over childreri's shoulders while 
they are working to see how they are getting on; it inevitably upsets 
them. Stfll less shduld they criticise, or help, childfen with wrong 
answers. $ 4 eo? 
^12. Testing and Teaching.—Many teachers make bad testers be- 
cause the teaching attitude is so ingrained in them {hat they cannot 
refrain front drawing morals, or роіпі out mistakes. A test is not 
a kind of lessen, and children should not be expected to learn'any- 
thing whatsoever from it. Rather it is a mattereof stocktaking. The 
essential object is to obtaim à record of the testees’ performances 
under certain standard conditions, so that ‘these performances may 
be compared with those of other similar testees who have been 
tested under the same conditions. E E 
13. Scoring.—The scoring instructions must be followed «as 
meticulougly as the instructions for applying the test. The tester may 


_ зоте тез disagree with the response provided in the scoring key, 
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к 5 | 
and think that some alternative is just as good or better. This may be | 
во, but once personal opinion enters the scoring ceases to be ob- | 

s jective, and the results again canot be compared with the test |: 


norms. Thesscoring of тћоѕі group tesis is quite mechanical, but 
'errors very readily creep in, Both in checking the right responses and 

"in adding up'correct checks. It is extremely desirable for such tests 

to, he scored twice, preferably by différent persons. 

14% The Subjective Situation in Individual Testing.—The conditions 
of individual testing réquire still more tact and sympathy than those 
of group testing. By means ОҒ preliminary conversation the child 

, should be put at ease, and interspersed irrelevancies should keep him 

interested and responsive throughout. If he shows signs of fatigue, 
distraction, or resistance the testing should be broken off. The good 

о tester develops a kind of ‘bedside mahner,’ and avoids formalities. 
ь Не does not read the oral tests from the handbook, but, knowing 

them by heart, says them in a conversational tone. In order to keep 

the child interested he must himself sound interesting. Encourage- 

mént should follow every response, but some caution is needed in 

=> praising incorrect responses, since many children may realize when 

they are wrong and become careless or discouraged if there is no 

criticism nor stimulus to do better. The presence of a third person, 

particularly the mother or head teacher, constitutes a distraction 
which should always be avoided. 

In giving the older versions of the Binet scale; it was the common 
practice among trained testers to%jump about from one part of the 
scale to ancther. This enabled the tester to choose the test which 
seemed to him"to be most appropriate to the child's moód at the 

P moment, also to save time by grouping together tests with identical 
instructions (e.g. the digit memory tests); and it prevented all tlie 
most difficult tests from coming at the end of the session. This pro- 
cedure has been forbidden By Terman and Merrill (1937). But Hutt 
(1947) shows that it makes no difference to the I.Q.s of normal 
children and is, if anything, more fair to uristable children. The 
Térman-Merrill handbook gives several'other useful hints on obtain- 
ing good rapport, whilé at the same time adhering strictly to the | 
standard instructions ànd scoring. | 
15" Record Keeping.—The following principle is commonly : | 
adopted, by experienced testers: never spend any of the child's time | 
in making records of his responses. Notes should certainly be made, 
,not only of test responses, but also of manner, speech, etc. All the у 
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% . 
recording, however, should;be done while the child fs engaged either 
in doing a test, in irrelevant conversation, or, in listening to the 
instructions for affother test; arit it should be so unobtrusive that it “e 
# never constitutes а distraction. Poor testers not only waste time,in 
recording responsessand looking up the scoring, but also allow" 
р awkward gaps to occu which entirely spoil the favourable аџпо-" 
*. врћеге of alertness and interest. Th€’reason for this is usually mere 
lack of familiarity with instructions and scoring. 5 
16. Impartiality.—While the good testersenters very fully iato the 
‚ thoughts and féelings of the child he is testing, yet he also’ needs to 
exert eonsiderable self-control. There are various ways in which ће 
may (quite unwittingly) give illegitimate hints, for instance, бу over- 
emphasizing the crucial words in an oral test, or by making empathic : 
| movements in the direction of the correct blocks, holes, etc., oT as 
4 performance fest. Further, he cannot help but feel more sympathetic / 
with some children than ‘with others. He should, therefore, fre- 
quently ask himself: “Am I in any way being prejudiced by my 
liking for this thidd ; am I giving hints or crediting him with doubtful + 
passes because I believe that he is bright, which I would not do if Il  ,.. 
believed that he was dull?" 2 е 
17. Studying the Whole Child.—The tester should never regard 
his job merely as measurement of the I.Q. or other test scores 
(although many doctors, teachers, and others may seem to regard , 
that as his sole function). Rather he should try to understand, and 
to build up a picture of, tachschild’s whole persorfality. By doing so 
his test results are more likely to be accurate, since he will realize 
when (һе аге being distorted by emotional or other factors; and ће > 
will certainly be better able to interpret their significance when ће * 
* guidance ог treatment of the child is under consideration. Although, 
as we have seen, he is not justified in analysing a heterogeneous test 
life the Biñet scale into tests of hypothetical discrete abilities 
(memory, verbal, practical, etc.), yet һе can pick up innumerable 
hints regarding the chi]d's intellectual and emotional characteristics, 
which can later be explored môre thoroughly. Suggestions of specific 
educational difficulties may readily be followed up. Buehler (1938) 
shows that responses to the Ball and Field (or Purse and Field) test 
may give valuable diagnostic indications. Other Binet items and; still ~ 
more, performance tests like the Porteus Mazes may be equally 
revealing (сї. Vernon, 1937; Porteus; 1952). These deductions should, 
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information, such as the psychiatrist’s ease-record, when available. 
In this way a tester can always be extending his methods and im- 


_ proving their validity. 


118. Interpretation of Test Results: Unreliability —No, mental test 


"score should ever be accepted atits face value, nor trusted in the same 


” мау as physical measurements are trusted." Even the best tests, it 


should be remembered, only measure to within a certain Probable 
Error. In other words, a test score should be regarded as the centre 
of a range of possible scores. It may be correct to within, say, 5%, 
but also it may be in error by as much as 20%, unless it has been 
confirmed by the results of other tests which narrow the range of 
uncertainty. ^. : 

19. Units of Measurement.—The imperfections of mental units 


» should also be borne in mind, for example, the variations in size of 


М.А. and E.A. units (cf. p. 70-1). The 1.0. is still more difficult for 
teachers and laymen to interpret correctly. When derived from 
M.A./C.A. it is primarily a measure of rate of mental growth. But 
wien it represents a standard score (as in Moray House, Wechsler, 
and other tests), it is—like a percentile—a measure of brightness 
relative to other children of the same chronological age. In neither 
case is it an index of the amount of intelligence the child possesses 
at that moment; similarly the E.Q. is nof a measure of present 
attainment. The M.A, or E.A., on the other hand, do provide such 
indices. The point may be brought out by the following example of 
the test scores of two children, Aand B. 

А: > ~ СА.9:0 МАП 9 1.0. #22 

Bf. . С.А.8:0 М.А. 10:6 IQ. 131 


Неге it is A, по! B, who is the more intelligent, since he can do more 
difficult intellectual tasks, and can be expected to be more advanced 
in school work. It is true, however, that B із the»brighter relative to 
8-year-olds than A ds relative to 9-year‘olds; and аз В is growing 
more rapidly іп mental powers, it is likely that he will catch up with 
An another four years, and (if the-test is reliable and valid) that he 
will, as an adult, be the more intelligent of the two. 
In'actual.practical work of educational guidance or emotional 
treatment, the tester need hardly, bother to calculate the Г.О. or 
E.Q., since what, matters is the present level of «ће child's intelli- 


‚ gence and educational capacities. The only proper use of the 1.О. is 
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for predicting atestee’s future career, i.e. for assessing approximately 
his ultimate educational or vocational level. — , 
20, Dependence of 1.0: on the'Test Employed.—The naive festeror “e 
laymen usually assumes (Нас a child wotld obtain the same 140. 
whatever the test erftployed. This is so far from true that the test 
employed should invariably be quoted along with an’M.A., ВА. 
ЖО? IQ. or E.Q. The reasons for these discrepancies have already been 


given, and will merely be summarized here. . 
(а) Chance errors of measurement, due #0 the fact that no test is 
perfectly reliable (cf, pp. 125-9). 9 T 


‹ (b)«Many tests are not well standardized. For instance, American , 
group tests may give results which аге too lehient, (рр: 66-8). 

(c) Different intelligence tests have different dispersions of M.A.s p 
and 1.0.5, so that one test makes a bright child qut to be much * 
brighter, a dull child to be much duller, than dbes another test . 4 
(pp. 71-3). m 

(d) The testees may have had more practice or coaching relevaut 
to one type of'test than to the other (pp. 165-7). A 

(е) All tests are liable to distortion by unsuspected group factors, . = 
or by emotional or оїһер influences, about which we possess little 
exact knowledge, and which we are not able to control (pp. 163-5). 

In view of this uncertainty over the norms and the dispersion of 
group intelligence tests, the inexperienced tester would be well ad- „ 
vised never to calculate individual 1.0.5 from them, but to treat their 
results in purely relative faslsion, ће. in the way Which was recom- 
mended for educational tests (cf. p. 68-9). Тһе class teacher сап 
obtain frám a group test practically all the information which it is ^ 
capable of yielding if he or she simply finds which child jn the class , 

.. hás the highest score (corresponding to the highest M.A.), which the 
next highest, and so on. When test results are thus confined to rank 
orders, they àre far less likely to be misihterpreted than is common 
at present. 4 * ~ - eae 

21. Conclusion—Mental tests have been widely, and .often un 
fairly, criticized. But the progiess of testihg is probably hindered to - 


= s > - a 
1 An exceptión may be matle in the case of 11+ selfction tests, which almost 
invariably adopt a Standard Deviation of 15. Such a use of quotients provides 
the simplest way of making age allowances, without which far more of the older D 
children in a year-group of candidates than of the younger ones would gain, 
* places. At the same time, egregious statistical errors, leading to gross ufifairness, 
сап ре and often are committed when Education Authorities try to run such 
„examinations without the advice of a statistically trained psychologist. 
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a greater extent by its friends, who are ignorant of, its limitations, 
than by its enemies. One of the main objects of this book is to 
o amalyse these limitations and to explain the underlying principles 
which must, be understoed if tests are te be applied and interpreted 
"properly. No one ought to use tests who is not willing to familiarize 
У himself with-these matters, and with the literature which bears on the 
particular tests that he employs. In skilled hands, testing provides a 
far surer and more accurate tool for the assessment of abilities than 
do subjective impressions or the ordinary examination paper. But 
human-rature is far too coraplex to be measured in the simple and 


direct ways in which physical quantities can be measured, and Ше, 


over-enthusiastic. but ‘unskilled tester is only too likely to make 


serious mistakes. ‘ 
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Functions of Examinations.—The chief functions of ordinary written 
examinations may be summarized as follows: * " 


1. They are employed as tests of achievement, to prove that an 
individual pupil has, or has not, acquired a'certain amount of know- 
ledge of a subject. — ' Е жады Ж 2 

2. They are given tq a whole class, or school, for a similar purpose, 


or for assessing the efficiency of the teachers whose pupils ate ” 


examined. Although ‘payment by results’ has disappeared, schools 
are still commonly judged"by the public on the basis of their ex- 
amination successes. And both head teachers and inspectors try to 
ensure the maintenance of certain standards by means of periodic 
oral or written, examinations. у > 

3. Мапу of the most important examinations аге assumed to 
possess а prognostic function, i.e, to be capable of predicting future 
achievements. Thus they are used to bar the gates between the 
primary and the secondary school, between the secondary school 
and the university, between the university and many professions; 
and they are supposed to*selegt the,few who are best fitted for these 
higher stages. , 


4. Many qualities besides achievement, or promise of achieve- * 


ment, in particular scholastic subjects, are believed to manifest them- 
selves in.examination success. The business employers who insist 
that their clerks shall have passed Matriculation or the General 
Céttificate dë not actually desire employees who are good at Latin, 
biology, and the like. When asked why Шеу attach such importance 
to academic examinations, their answers show that they hope, rather, 
to acquire clerks of good general all-round ability in non-academic 
as well’ as in academic subjects..They say also that success at ап 
examination ‘requires а certain amount of perseverance and in- 


and that these qualities of character are desirable also in their 


* employees. The significance attached to a university degree is often 


almést as variegated. Employers are probably quite as much to — 
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‚ dustry, some docility to discipline, and calmness under pressure; «s 
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blame for the stranglehold of the contemporary examination system 
as are the teaching profession and the education authorities. 

5. Another function of examinations whose existence is seldom 
explicitly admitted is that-they constitute-an incentive for stimulating 
‘pupils and students to work, particularly in the secondary school 


"and university; also for keeping teachers up to the mark. Indeed, 


some would maintain that the prospect of having to do examinations 
must still be employed as one of the chief motives in education, even 
if theis results are entirely worthless. Clearly, then, examinations are 
an essential feature of the whole educational system which relics 
largely upon extraneous motives, such as competition, fear of 
punishment, and blame. And they can hardly be discarded until the 
more progressive ideal of appealing to the pupils’ or students’ 
spontaneous interests becomes а practical reality. 

6. Many American writers (e.g. Kandel, 1936) believe that the 
only legitimate function of examinations should be that of guidance. 
By checking up, the results of his instruction the teacher can sce 
where he has failed to make some topic clear, and can improve his 
methods. By examining a pupil’s accomplishments he can both 
diagnose the pupil’s weak points which require special attention, and 
can direct his endeavours along the lines for which he is most suited. 
Secondary schools and universities should apply examinations to 
entering candidates, not so as to keep out all but a select few, but to 
discover the fields of work in which each candidate is likely to do 
Best, and should dissuade from entering only those who could 
profit from.no kind of advanced study. 

Unfortunately, the traditional written examination Fails very 
badly in its attempts to fulfil these many different functions. We will 
consider first the criticisms which have been levelled agairst it, and 
later ask whether other measuring instruments would serve the 
purpose better, or whetherzts adequacy can in any way be improved. 

Griticism of the Effects of Examinations upen Education.—The 
first and commonest complaint is that examinations dominate and 
distort the whole curriculum. Although only about 5% of the 
population reach the university, еуегу grammar school pupil has to 
study for the Genera (ог the Scottish Leaving) Certificate as though 


. his ultimate objective was a university degree. The schools cannot 


afford to educate in the broader sense, nor to give the majority of 
pupils such knowledge as wouid be of most use to them in after 
life, because they must work towards examinations concerned 
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almost exclusively with formal, academic, subjects.” Again narrow 

specialization is apt to occur in grammar and public schools, for the x 

sake of university. Scholarships. € Pe NEC d EE d 

Even more widespread аге the bad effects of the 11+, secondaxy 
school entrance exansination. Subjeets or “activities which do not 
contribute directly to passing the tests are generally neglected in fhe ° 

“ Jast year er two of primary зећоо пе“ In the larger schools, children 

are streamed even at the infant stage into ‘classes thought likely to 

pass or fail. Homework and coaching out of school hours аге éntro- 

duced at too early ап age—more often, Фе it noted, at the behest of 

athe pagents than of the teachers. » 35 67-22 

Furthermore, these examinations stimulaté an unhealthy, com- 
petitive spirit in children, and at all ages they encourage the cram- 
ming of set books and rote memorization. Doubtless they act as an 
incentive (Function Ne. 5) but not an incentive to learning for its own 4 
sake. Indeed, examinations Яге An-ineflicient type of incentive, since 
most pupils and students regard them as too remote ёо bother about 
until a few weekssor days beforehand. A few candidates, however, " 
take them much too seriously, or are forced by their parents and с 
teachers to over-work; with the result that many of the maladjusted | 
children’ who come. to Psychological Clinics for advice and treat- 
ment show signs of this examination strain. 

An additional excuse is often advanced for this state of affairs, ое « 
which is based on the.old fallacy of formal discipline. It is said that 
although the passing of e&amiaations in academic Subjects is of no 
direct value to the vast majority of the population, yet the pupils’ 
minds аге rained thereby so that they become better‘able to learn 
other subjects, to reason, to write good English, and so,forth. А 
large body of experimental evidence not only proves that this spread | 
of trainin gin one field of study to other fields is usually extremely 
limfted.in extént, but also suggests that the very opposite may occur. 

The system may indeed train pupils and students in the best methotls 

of passing examitlationg in ether subjects, but itis more likely to 
stunt ‘reasoning power’ owing to its emphasis on mechanical 
memoriZation. Often it produces a dislike of everything associated 
with school, and so discourages the desire for farther éducation. It 
„does not help to devejop writing of good English, since the Ẹnglish © 
style in which most examination angwers аге written is known to be,« 

Оп the average, markedly inferiortó the style іп which the same 
pupils can write compositions at leisure. = — — $ 
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i 
INADEQUATE RELIABILITY OF EXAMINATIONS 
„We Will not, however, dwell on (Неве points since, for the purposes 
of, this book, it is a more serious matterthat most examinations are 


very defective methods df measuring the achievements or the future 


promise which they set out to measure. Their chief, though not their 
only, flaw is inadequate reliability: The following facts,are well 
established : я 

(а) When examinees sit two different examinations dealing with 
the same scholastic subject, both set and markéd by the same 
examiner, they are likely to obtain distinctly different marks оп the 
two occasions. | ^ А 1 

(b) When a single set of scripts is marked by two dillerent 
examiners, the resulting marks are liable to show wide discrepancies. 

The main causes of poor reliability аге; 

(1) Actual changes in the examinees’ mental and physical states 
which affect their examination performances. 

>(2) Inadequate sampling of the examinees’ knowledge and ability 
by the particular questions set. 

(3) Inconsistencies in the standards of marking adopted by 
different examiners, or by the same examiner on different occasions. 

(4) Differences of opinion between different examiners regarding 
the relative merit of the examinees' answers. 

1. Changes in the Examinees.—This was described above (p. 128) 
as ‘function fluctuation.’ Two closely parallel examinations sat on 
consecutive days will always give somewhat divergent results, be- 
cause some eaaminees will be feeling more fit, or in a better frame of 
mind, at the first of the two, others at the second. Many external and 
irrelevant happenings may temporarily depress or improve the work 
of a few individuals. Over long intervals the examinees’ abilities will 
naturally show, considerable developments and alterations. On the 
wliole, however, the uncertainties occasioned фу such factors аге 
probably smaller and less important chan those due to the other 
three causes. 4 > "y х 
» 2, Inadequate Sampling —This factor is responsible for tne com- 
mon complaint ‘among examinees, namely, that the questions ‘did 
not suit, them,” that on another set of questions they might do better 
{or worse). A pupil's capacities in the various special branches ог 
aspects of a subject are always sémewhat uneven. And no examina- 
tion is capable of bringing out everything that the pupil can do in 
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that subject, op of measurirfg all his strong and weak points. Hence, 
the half-dozen or 30 questions which he answers аге never тоге than 
a sample of the questions which'would be needed to cover the whole 
field. Common sense, as Well as statistic principles, tell us that, 
the more lengthy th& examination, ang the greater the number of , 
questions, the more representative vill thé sample be, and the better 
will the* ground be covered. "Thus in certain university honears 
examinations where the students’ marks are based on twenty-four 
hours of examination work, their marks would be unlikely t fo alter 
to any great extent if they sat another"similar series of papers. But 
whens an examinatiqn consists, say, only of one short paper în 
English and one in arithmetic, quiteslarge altera&ions might easily 
occur in further parallel papers. The correlation between resultseon 
ordinary 1- or 2-hour раре in the same subject depends con- 
siderably on "the heterogeneity, or range of ability, amohg the 
examinees (cf. p. 122); but a fairly typical figure is. + 0-70, and it is 
often less than this. We can therefore predict, ‘by applying the 
Зреагтап-Вго\уй formula (p. 127), that examinees would require 'to 
take at least four such papers before the ground was sufficiently 
covered for the reliability, coefficient to rise to the fairly respectable 
figure of - 0-90. 

Suppose that one hundred pupils take such an examination, and 
that the passing mark is fixed so that half of them pass. Now if they 
take a second similar paper which correlates + 0-69 with the first, 
we can predict that only 37 8f thé pupils would pass on both, and 
37 fail on bath, and that as many as 26 would pass the first and fail 
the secortl, or vice" versa. This hypothetical but only tog typical S 


illustration demonstrates the unfairness which inevitably results 


frbm theJack of thoroughness of many examinations, and in par- 
ticular the difficulties which arise when the examination is supposed 
to divide the'candidates into two distin¢t categories at an arbitrary 
pass mark. % < ° 

An additional reason for the variability of a pupil’s level in 
answerin g differerft questioris*or sets of questions is that the majotity 
of examination answers,consist Qf English compositions or essays. 
(This does not usually apply to papers in mathematics or in foreign 
languages; but it dées to English, history, geography, science, “and. 
other subjects.) And the English essay i is known to be,a partécularly 
variable production. The compositions written, by a pupil weekly 
throughout a school year fluctuate widely in their merits. As Ballard . 
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(1923) points out, teachers may refuse io accept thic statement, and 

claim that their pupils are fairly consistent from week to week. But 

the fact that a tedcher marks any Опе pupil's work on a dead level 
_ próbably indicates that, unwittingly, he has formed a stereotyped 
. conception regarding his Ability, and is marking the successive 
compositions not on their own merits but fn the light of his know- 
edge of the pupil's average achievement. The truth of this suggestion 
is Borne out by the many stories of the two pupils who regularly 
received, one a good, Ше other a bad, mark, and who continued to 
do so even when they exchanged their essays. 
* An outside examiner does not possess this knowledge of the 
pupil’s average work, and sv will give poor marks to those who 
happen to write answers much below their normal level, good marks 
to those who rise above it. Moreover, under the stress of examination 
conditions, such deviations are likely to be exaggerated. Probably 
much more representative products would be forthcoming if the 
examination could be answered at leisure, under conditions similar 
to those which, apply when pupils do written work at home, or at 
school, The common practice in Civil Service, anc certain other, 
examinations of using a single essay as en indication of the candi- 
dates’ hil-round literacy, cultural background, and ‘quality of mind’ 
is particularly deplorable, as pointed out by Vernon and Millican 
(1954) in an investigation of students’ essays on several different 
topics. ” 

3. Divergent Scales of Marks!—This factor has already been dis- 
cussed fully in Chaps. IL and ТУ, hence we will not pause to re-argue 
the illogicality and injustice of a system which permits afar higher 
» proportion of passes, or a higher average mark, to be awarded in 

one university or General Certificate subject than in another. ifa 

mark of 80%, or a first class or credit, signifies to one examiner the 

standard achieved by a quarter of the examinees, and 'to another the 

standard reached by only 1% of the exainineés, Шеп it is obvious 

that this mark means different things to different people. And it 

föllows that an examinee’s mark will vary witli the particular ex- 
aminer who happens to read his papers, чпіеѕѕ all the marks given 
by all the examiners are uniformly standardized in accordance with 
the principles outlined above. 

Tho standards of marking even of a single examiner аге liable 10 
alter with fatigue and mood. When marking large numbers of answers 
to Ше same question, he may begin very strictly and gradually 
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become more denient, or vice versa. The same. script might «receive 

a different mark if read after, instead of before, dinner. ^ 


4. Differences т Opinion as to Relative Merit.--The most impor- © 
tant source of unreliability (which is usually found in combination 
with No. 3) is the *personal element which? enters into еуегу “Xe 
aminer’s evaluation of & script. We will first summarize some of the 
evidence regarding this subjectivity: The classical experiments were 
carried out in America as far back as 1912. Starch and Elliott (1913) 
sent copies of a single geometry paper to the chief geometry teachers 
of 116 high schools, with the request^that they should mark it in 

o accordance with their usual practice. The marks awarded varied АИ , 
the way from 28% to 92%. Two of he teachers рау” it more than 
90%, 18 gave it between 80% and 90%, 18 gave it between 60% and E 
30%, and 2 under 30%. Similar results were obtained with papers іп" 
English and in history. Equally striking is the an&cdote related by | 0 
Wood (1921) of the papers which were marked in, the usual way by 
six examiners. The first examiner, for his own guidufice, wrote out а 
set of (what hè regarded as) model answers to the questions ; but'he 
unfortunately-left this among the candidates’ papers. He was sybse- . = 
quently awarded marks varying from 40% to 90% by the other five . 
examiners? Many experiments, conducted under far more strictly 
controlled conditions than these, have substantially confirmed their 
findings. Particularly noteworthy are the elaborate studies carried о 
out in England and*France in the 1930s, under the auspices of the 
International Examinations Enquiry Committee. “ 

Hartog and Rhodes (1935) arranged for typical groups of scripts 
to be mafked by experienced examiners under normal examination ^ 
conditions. Special place, School Certificate, and university honours , 
efaminatjons were represented, and several different subjects were 
studied. Generalizing from a number, of their results, it appears that 
whenehalf а болеп examiners independerftly mark aset of scripts, the 
lowest and highest marke given to any one script differ on the average 
by about 10 (012%. "Атође a few of the scripts the discrepancies 
may be quite small, but aniorlg others they may be far larger, often 
greatcr than 20%. The authors show that patt of the divergence may 
be ascribed to the different standards of leniency adopted by different 

. examiners (No. 3, above), but that even when this factor is elimin- ^ 
ated, large variations remain. In*several of the experiments the 
examiners drew tip an agreed schéme of instructions for marking, 

_listing both the general characteristics and the detailed points which , 
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» 2 E 
were to be looked for in the scripts. In the marking of English com- 
positions, where this was scarcely possible, the inconsistencies were 


» much greater. ° ” 


«Now Hartog and Rhodes doubtless desired to shock the public 
‚„ conscience with regard to Examinations, and’ so did their best to 
emphasize thè disagreements. The present writer 18 more impressed 
bythe smallness than by the largeness of the discrepancies, and 
would conclude from his study of the results that important public 
examinations, like thé Sehool Certificate and university honours, are 
marked much more consisténtly than is usually the case. Instead of 
laying their chief emphasis on the extreme discrepancies, the authors 
might have pointed out that the median disagreement between any two 
examiners is not more than 3% in the best-conducted examinations. 
". Since this figure represents the Probable Error of an examinee's mark, 
a few examinees will naturally show discrepancies four or five times 
as large. It represents an average correlation between different 
examiners of -{-°0-80, when the standard deviation of their marks is 
10%. And decidedly lower figures have been obtained by other in- 
vestigators. The authors also omit to point out that the reliability of 
an examinee's final total mark is generally very much higher than 
their results indicate, since it may be based on half a dozen or more 
papers, each marked by two or more examiners. Their results mostly 
refer to single three-hour, or two two-hour, papers. 

Nevertheless, this study deserves to be taken very seriously, since 
it indicates that many less thorbugh"examinations are deplorably 
unreliable >that in the absence of a scheme of instructions drawn up 

' and applied By éxperienced examiners, much worse dis¢repancies 
» may arise; that when the average and the dispersion of marks are 
not standardized, gross differences may appear in the proportions of 
Credits, Passes, Fails, etc., which are awatded ; and that even a Р.Е. 
of 3% may frequently make all the difference between a Pass'and а 
Fail, or a First and a Second Class. ae 
2 Another example will be quoted which better illustrates the sub- 
јеснуну of marking under ordinary school conditions. Boyd (1924) 


collected large numbers of essays’on ‘A Day at the Seaside’ тот. 


l-year-old Scottish ’school-children (i.e. children at the qualifying 
' stage). Prom these he selected 26 as representing the whole range of 
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е experiments were rendered quite futile by the selection of unduly homogene- 
, OUS sets of scripts for the examiners to mark, 
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merit, and had them marked independently by 271 teachers. ‘The 
marks were not numerical, but consisted of seven grades, ranging | 2 
from Excellente fo Unsatisfacfory. On comparing the results. of © 
different markers, it was found that almést every essay тесеіуеф at 
least 6 of the 7 posstble grades. The fáflowirg instance is typical: (55 
teachers marked it Extellent; 31 VG + 80 VG, 123 (i35 28 (3; 
4 and 4 marked it Moderate («Не lowest grade but one). Boyd Л 
that part of the discrepancy was due to differing standards. Butreven 
when the 20% most lenient markers and the 20% most severe were 
excluded, the average essay was still afvarded at least 4 out^of the 7 
о grades. He also points out that the average grade given to an essay „ 
by a large number of markers is highly consisfent,°and that the 
various divergent individual grades are distributed around «his 
average more or less іп accbrdance with the normal distribution" 
curve. Actually it has been shown by Wiseman (1949) that а'теазоп- 4 
ably reliable mark for English composition can be obtained, at the 
11 + selection stage (since the range of ability is*Very much wider 
than it is among*highly selected grammar school pupils or university 
students), previded that four examiners’ marks are combined, It is. = « 
desirable also that each, candidate should write on two or тоге. 
topics. Thé burden of marking is not so heavy as might appéar, since 
only the essays ОҒ some 10 to 20% of borderline candidates need to 
be considered ; and rapid general impression marking at € average e 
rate of one minute per script is adequate. 
Other experiments, including оле of Hartog's, indicate nd when 
the same examiner re-marks a set of scripts after an interval, the 
discrepaftcies betwen his first and second opinions “аге likely to Бе” 
almost as great as those.between two independent examiners. Some » 
of the stydies by the French members of the International Examina- 
tions Enquiry are also worth quoting, A set of papers in science were 
marked by three trained scientists (one Of them doing it twice over), 
and also by а student who had no scientific training. When the mark- 
ings were compared, “һе average correlation céefficient between the 
scientists was + 6-63, whilst the correlations between the studentand 
the scientists averaged „| 0:51. Тће latter figure i is lower than the 
former, but so little lower that it casts some su$picion on the value of 
* an examiner’s knowledge of his subject. 79 
e _ Particularly striking was a study t ‘of the philosophy examination‘in 
the Baccalauréat^One hundred scripts were marked by six examiners. 
Only 10 of the candidates were т by all the examiners, only 9. 
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failed by all of them. The remaining 71 were passed by some, failed 
by others. It is, of course, always the examinees of medium ability 
> whose true standing is-most uncertain. The few best or few worst are 
resognized .as such by ай examiners. But only too frequently an 
arbitrary dividing line—the pass mark—has «to be erected in the 
" middle of this region of maximum uncertainty. Undoubtedly, there- 
fore, а large number of candidates who just succeed, or just fail to 
gaim a General Certificate pass or a scholarship might very easily 
obtainthe opposite resultif they were marked by a different examiner, 
or if they sat an alternative set of papers. In many instances their 
whole educational or vocational career may be blighted owing to 

the unreliabiiity of their marks. 
Reasons for the Subjectivity of Marks.—The prime reason for the 
' discrepancies which we have described is that different markers con- 
ceive differently the desirable or undesirable characteristics of exam- 
ination answers. In other words, they attempt to assess many diverse 
types of abilities. One examiner may give chief credit for evidence of 
work done, and-for the comprehensiveness and aceuracy of the facts 
contained in the answers. Another may look rather for signs of 
future promise, originality, and grasp of general principles. One may 
insist ой clarity of expression of ideas, another may try to assess the 
profoundness:of the ideas irrespective of their good or bad expres- 
sion. Some may search for emotional rather than strictly intellectual 
- qualitiesesuch as the examinee’s interest in the subject. Whenever а 
candidate states His attitudes or opinions, these are liable to coincide, 
or clash with, the attitudes or opinions of the examiner ; and however 
desirous the eXaminer may be of maintaining impartiality, ке is liable 
„ to bias if hjs pet theories are approved or attacked by the candidates. 
Often, if he would take the trouble to formulate explicitly kis notion 
of the aim of the examination, he would find that he means by a 
good examinee the one who closely approximates to himself, and as 
a poor candidate the one who shows none of the: intellectual or 
emotional qualities:which he himself id¢alizes. Is it to be wondered, 
then, that different examiners differ in their opinions of a candidate? 
"We must recognize that any one script may manifest an extreme 
complexity of features, good and bad, and so сап be looked at from 
а multitude of different points of view. Its merit can never be assessed 
їл the same way that a physical object is measured. An examiner 
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ordinarily spliás it up into its chief componentsnotes the facts which 
it contains or omits, its elements of originality, style of expression, , 
the mechanics of its grammat ог computatiofis, and so on. "Не © 
attaches а certain weight to each of thest possible components (a 
weight which is likely to be highly individualistic, unless 4 detailed, 
scheme of marking ha$ been adopted), айа finally re-synthesizes the 
candidate’s performance on each component into a single mark, for 
the answér as a whole. Some examiners do not attempt a systematic 
analysis of this kind, and merely mark en the basis of a ‘general 
impression. But as Ballard (1923) and'Hamilton (1929) shoW, every 
о general impression involves a more ог less vague analysis. „ pou. 
Even in such subjects as arithmetic, or the translation of foreign 
languages, the weightings assigned to the various analysable features 
of an answer, and the репа ез given to various possible mistakes, " 
are liable to differ among different markers. But tht complexity is far 
greater in the case of English composition, or in those subjects such 
as history, science, ete., where the answers consist'chiefly of com- 
positions. The efsay-type of answer has been very widely, and justi- " 
fiably, criticized. Examinees have to waste a large proportion of the 
available time in translating their historical or scientific knowledge 
into readdble essay-form, and in the mere process of writing. This 
latter is particularly unfair, because some can write much faster than 
others without necessarily being better historians or scientists. А • 
considerable propor tion of the written words (for instanve, the a’s, 
the's, апа”, etc.) add nothirfe to the evidence of the examinee’s good 
or poor ability. Investigation shows that the average eXaminee puts 
into writing less than one fact or idea per minute оѓ examination" 
time, since the process.of expressing these facts or ideas takes 50 „ 
lóng. The examiner, then, has the still more difficult task of transla- 
ting the product back again, of trying to penetrate through the 
ver biage to the signs of ability and knówledge beneath. Few exam- 
iners can renvain entirely uninfluenced by the literary style of the 
handwriting of an arfswer (there is direct experimental proof of this), 
although they ate presumably trying to mark the product for; rats 
historical oy scientific merit, not qua essay. n : 
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The poor reliability of слала Оле greatly reduces their value for 
any and every purpose. For if they cannot even measure examination, 
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success without serious errors, they naturally cannot-validly predict 
anything else. Even, however, if perfect reliability were established, 


» validity would stil? be diminished by the dependence of examination 


» 


suscess on factors other than ability, knowledge, and industry. 

” One such factor is! obvibusly, the efficiency of the examinee's 
teachers as examination coaches, and their cieverness in guessing the 
kinds of questions which examiners will set. It is manifestty unfair 
{пара pupil at one school, where the teachers are specially adept at 
instilliag examinatiort knowledge, should succeed when a pupil of 
equal ability fails because ће attends another school, where the 
téachers are either less skilled, or more humane in, the edueation 
which they provide. ` ^ 

There is also the old complaint that some pupils are “good exam- 

” inces,” while others become unduly *riervous,' and ‘fail to do them- 
selves justice." One wonders whether it is not usually the weaker 
candidates who make this complaint, and whether they would do 
better if subjected to any other kind of test, including the tests of real 
life. However, we may admit that certain temper2mental qualities, 
which we can as yet hardly specify, together with good physical 
health, may confer some advantage on certain candidates, which 
other candidates lack. 5 

Examinations, then, certainly fail to fulfil the first two functions 
with which we,started, namely, the measurement of achievement, 
both because of poor reliability and because of the other considera- 
tions just mentioned. How imperfect they are we cannot determine. 
We can, however, study their validity as measures of future ability 

* by directly cortelating their marks with the results of the same pupils 

» at later stages in their careers. Comprehensive investigations into 
this matter were carried out by Valentine and Emmett (1932), with 
almost uniformly disheartening results. By comparing the subsequent 
secondary schoel achieverlent of groups of pupils who had been 
awdrded, or who had failed to obtain, scholarships at'the time of the 
special place examination, it was found that many of the latter 
ultimately surpassed the former group. Similarly, at the university 
level, Valentine found that roughly two-fifths of all those awarded 
scholarships failed to'justify their promise, since they eventually ob- 
> tained only third-class honours or pass degrees. They were beaten 


by as many as one-third of the hon-scholars, who achieved first- or , 


second-class honours. ОР those who won scholarships given Бу the 
, universities, only about one-fifth failed to live up to expectations, of, 
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State scholars only one-tenth. But this meant Ша practically опе- 
half of those who obtained other types of scholarships (from their 


e 


schools or from, focal authorities) did not justify themselves (cf. % 


Valentine, 1938). “ * 
In a further study фу the Scottish Council for Researchs in Eduga-" 
tion (1936), marks gained by secondary school pupils at the’Leafing * 


„Certificate examination, together with teachers’ marks айа the head 


masters’ estimates of the pupils’ general "standing, were correlated 
with subsequent class marks and degree marks at the universities. It 
was found that’on the whole those who got first-class honours had 
been rated higher by the school head ma$ters than those who got 
second class, and that honours students had*been, rated higher than 
ordinary degree students; but that, as in Valentine’s research, there 
were a great many exceptions. The actual correlation coefficients ® 
between school and university marks varied from» about +40:30, to 
--0-70, and the average óf several such results was - 0:48. In the 
light of our discussion of correlations (Chap. ҮШ; this figure = 
plies very poor predictive validity. 9, 

It should bg remembered, however, that the above irivestigations 
(together with most of the studies of reliability) dealt with highly 
selected groups. At the ЕТЕ? selection stage practically the whole 
range of children is examined, and more reliable fests than the 
conventional subjectively marked papers are generally used nowa- 
days. The result is that this examination gives correlations around 

++ 0-85 with performancë in secondary school work 2 to 5 years later 
(Emmett and Wilmut, 1952). Moreover, in view of the imperfect 
reliability of the school marks used as a criterion of athievement, the * 
true figure should be of the order of +0-90. Yet even this coefficient % 
iraplies that many mistakes are made in selection. If 20 out of 100 
are admitted to grammar schools, 5—that is one-quarter—are likely 
tő turn out “unsuitable, and 5 who have been rejected and a to 
modern schools would have surpassed them. 

It would be ünreasonable to expect perfect gorrelations НЕ š 
marks at successive stages» їп pupils’ tareers, since the subjects 
шаі at a secondary school differ distinctly : from those studied:at 
a primary or preparatory school, even whea called by the same 


` names. And the cofditions of work at a university differ from those « 


at a school, béing more congenial to some, less stimulating 40 
others. Again, every individual changes to some extent as he grows 


Е older, some of his abilities becoming relatively weaker, new ones | 
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"and'who base their awards and selections upon them. The univers 


, use of the services of a scientific vocational psychologist, who would 
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developing. All iii all, then, the inability of entrance and scholarship 
examinations to choose those candidates who will profit most from 


› higher education is not-very surprisiug to the psychologist. Yet their 


validity for this purpose is certainly much poorer than is generally 
Supposed by the educational-authorities who set the examinations, 


ties would scarcely continue to dictate the present constitution of the 
General and Leaving Certificates, and Matriculation examinations, 
if they did not believe that these instruments picked out the most 
suitable*entrants. р 

Still more unjustified is the belief embodied in the fourth of the 
functions outlined above (р. 196), namely that these examinations 
are adequate criteria of vocational fitness. It is indeed true, as cm- 
'ployers assert, that successful candidates will on the averag 
better in general all-round ability than those who have failed to gain 
the Certificates, since all scholastic abilities are to some extent 
positively correlated with intelligence. It is probably true, also, that 
vovationally desirable temperamental and character qualities may 
aid in examination success. But so do a great many other factors. 
Some pupils with extremely undesirable characters may do well at 
examinations through intellectual brilliance; others through eflicient 
coaching on the part of their teachers, or through Qver-cramming ; 
still others through the pure chance factors introduced by th 
subjectivity of marking and other forms of unreliability. The con- 
clusion follows, then, that success or«failüre may depend on so 
many diverse qualities, that it does not provide trustworthy evidence 
of any of them? The employer would be far better advised’to make 


ge be 


measure or obtain the best available assessments, both of intelligence 
and other aptitudes, of educational achievements and of tempera- 
mental traits, and would give each of these qualities its due weight 
in deciding on the suitability of prospective employees.. Undoubtedly 


· the adoption of a similar scientific study of eacl; individual candidate 


for secondary or university education would lead to many more just 
selections, and more accürate predictions of future achievement, than 
do the present methods (cf. Dale, 1954). “ 5 

Although this chapter has consisted mainly of destructive criticism, 
certain hints have been given as to possible ways and means of 
improving examinations. These wiil be followed up, and expanded 
in the next chapter and in Chap. XIV. : 
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RELATIVE,ADVANTAGES OF NEW-TYPE AND У 


ESSAY-TYPE:EXAMINATIONS “|, 
ù 
Ir has been widely claimed that the worst defects of the traditional 
type of written'examination may be overcome by adopting in its 
« Расе йе objective or new-type test for'measuring achievement. 
Such tests came into common use in dhe Unfted States many years 
ago, but are still little known in this country, in spite of Ballard's 
(1923) able advocacy. Amerigans, however, realize now that they 
also are liable'to certain defects when used undiscriminatingly, and 
that Шеге аге still functioris which can better be performed by the 
old- or essay-type of examination. Thus we are in a position to profit 
both from theit researches and from their errors, to devise our tests 
in accordance with the principles they have already worked out, and 
to apply them to the fields of educational measurement where they 
will be most useful and most valid. a 
Now it is clear that some educational products can be marked 
much more objectively and reliably than others. In the early stages 
of school arithmetic, when children are given a series of short sums 
to do, there is no subjective elamentinvolved in deciding which ones, 
and how many, they do correctly. But before long arithmetic be- 
comes тёге complex; each sum involves a number of different 
operations, partial credits are essential, and different markers are 
now liable to differ. An English composition, or an examination 
answer written in essay form, brings in a still greater personal еје- 
ment, since ib is still more complex and annot ever be considered 


merely from the point of view of right or wrong. The new-type - 


tester argues, thereforg, let*us refrain from setting sums, or essay 
questions which involve mary óperations;'and instead ask questions 
which ёуоке each operation separately, questions to which only ore 
right answer is possible. And instead of leaving 40 the marker's 
‚ Personal taste the an@lysis which must be made of a complex product, 
„and the weighting which must be attached to its component ele- 
ments, let the person who sets thè questions make the analysis 
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befotehand, and decide’ systematically just now many marks (ке. > 
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how many questions) shall be allotted tq each element. 

Advantages of New-type Exeminations.—So we arrive at the new- 
' type exdmination, which contains аЧагее number of brief questions 
instead of a, few long ones. Every question is so arranged that all 
"markers will agree assto the rightness or wrosgness of an answer. 
"Та elimination of the personal element is not the,only advantage. 
Бог the hundred or more questions of the new-type test are,capable 
of "sampling the whole field of knowledge much more compre- 
hensively than can the half-dozen or so questions of the orthodox 
examination. Candidates саг no longer complain that the particular 
questions failed to suit them, since all their strong and weak points , 
are probed. This fact should help to reduce the ‘spotting’ and cram- 
mipg of likely questions. Again, the subjectivity of standards of 

» marking need no longer cause trouble, since the examinees’ marks 
are pure ‘count scores? (cf. р. 20). So long as the questions are de- 
signed to cover all levels of difliculty evenly, the distribution of marks 
is, sure to approximate to normality, and can readily be converted 
into any, desired percentage or other scale. The examiner may, of 
course, draw an arbitrary pass line at any point in Ше distribution 
that ће wishes, so as to cut off the examinees who fail to come up to 
his minimum standards, , 

A full description of the various sorts of questions which may be 
employed in new-type tests is given in the next chapter. Here it will 
be sufficient to distinguish between the so-called ‘open’ or ‘recall’ 
type of question and the ‘closed’ pr ‘recognition’ type. In the former 
the desired. answer consists of a single word, or a few words, or 

> numbers, whieh the examinee must write eithemin a space;provided 
on the éxamination paper itself, or on a separate sheet. Questions 


9 
1-16 on pp. 213-14 are of this type. Since, however, there.is always 


some danger of partially right or alternative answers being given, 
whose marking: ‘would involve subjective judgment, the other type of 
question is more commonly adopted, whgre the examinee has to 


» identify the right answer from among two or.more'answers printed 


од the examination,paper: Either ће-тау underline or put a check 
beside the answer he velieves to Бе correct, or else merely write 
down its letter or number. Questions 17-54 on pp. 215-19 are of this 


‚ type. х 


» This Јаџег system has the additional ат that practically all 
writing is eliminated. So that wien geography er history is being 
2 examined, irrelevant factors such as ч of writing or literary style 
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* 
do not distort те examiner's marking. It has been claimed that ‘the 


average examinee, instead of expressing less then one idea „ог fact x 


per minute (as inan essay-paper), can deal with thiee to six new-type © 
questions a minute. But this naturally depends on the level of difi- - 
culty of the questions: A further advantage i is that the examinee who | % 
knows scarcely anything about the subject, but who biüffs by *free- - 
associating' around an essay-type Question, and who often gains 
marks by doing so, is entirely foiled. Older students realize how much 
fairer is the new-type test, and on the wholespréfer it to the orthodox 
examination. Younger pupils, also, according to several American 
investigations, enjoy it more. 

A short test is appended below, dealing with обаа of 


the previous chapters in this book, which the reader is advised ‘to 3 


try to answer. It is not теане to be a complete examination on 
mental testing'and statistics (hence it violates several of the rüles of 
construction given in the next chapter). But it introduces several of 
the main varieties of new-type questions, some easy, some fairly 
difficult, and sHowld give the reader who is not already familiar with 
such questions.a better insight into them. Nos. 1-8 are simple-recall 
type, Nos. 9-16 completign type, Nos. 17—25 true-false type, Nos. 
26-36 multiple- choice type, Nos. 37-47 matching type, and Nos. 
48-54 rearrangement type. The correct answers are listed on 
p. 258. У 
EXAMINATION ON MENTAB MEASUREMENT AND STATISTICS 

Time: One Hour 


Please Фай the foltowing directions carefully before Starting: 

Answer as many questions as possible, in any order. The answers 
Shquld be written in the’ right-hand margin on the blank spaces 
provided. ‘Any rough figuring may be done on blank paper. 

The answer to every question (except Nos. 42— 7). tonsists of one 
letter, word, ór number. 

You may guess if»you are not certain of an answer; but as matks 


D 


will be deducted for» ,Wroag answers, random guessing is not ^ 


advisable. S rte 2 = ons 


+ B + 


1. Eight pupils obtained the following schoo marks: | * 


А-В C DEES UH 
15 10 12 10 9 12 10 18 2. 


What is their median? _. DIU A MEUSE ae AC 
2. 1f the above marks were put into Tank order fora, "what деш 
B’s rank position be? х У D ; 
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3-5, The followingyis an extract from the norms for a group intelli- 
gence test: , 
» •' Score. а 48 58 68 77 84 89 93 
`. Mental Age. 9 «10 11 12 13 14 15 and adult 


T What are the respective intelligence quotients of: 

» 55 - 3, A,child aged 8:0 who scores 77? ,. i BET niis 
"Д. ЗА child aged 10-0 who scores 63? . . Es. 
ње — 5. An adult aged.20-0 who scores 77? ^ CAPTA 
6-8. № a certain secondary school all the marks are transposed into 
the following standard scale. A pupil obtains marks of 60%, 
41%, and 63% respectively in English, French, and Mathe- 
a matics. In his English class there are 41 pupils, in his French 

class 35, and in his Mathematics class 44. : 6 


” Percentile Mark 
A 90 . 3 73% 
75 Жа COL 7% 
\ = 50 60% 3 
251% 5 #58. 
> 10:5; - 41% 
| What is his approximate place (i.e. his rank-position in the 
ЫК class) in: ' 
* 26. English? . А : 7 Е | Je И ae 
7. French? . 2 5 ТМЗ. : COUTE: 


"8. Mathematics? . қ ) у i ous 


9-16. In each of the numbered spaces in the following paragraph, 
one word.or letter is needed to complete the statements. Do not 
write anything in the actual spaces, but opposite No. 9 in the 
right-hand margin write the word or letter which is needed to 
fill up space (9). Fill in the other words or letters similarly. 

› The compensation theory of mental ог- 
ganization is disproved by the fact that all 9.` 
correlations between mental abilities аге (9). 2 
The results of correlational investigations — 10, " : 
also disprove the view that the mind is made E c 
up of distinct (10). Fhe Two-factor Theory, — 11. 
proposed by (11) regards each ability as 

made up of a factor common to all abilities, , " 12, 
called (12), айа a factor (13) to that ability 

alone, called s. However, when -vérbal tests, Ыб 
» performance test$, and other tests of intelli- 5 

gence are comyared, it is doubtful whether 14, ........ 
(14) will account for all the inter-correlations. 

Residual correlations indicate the ргезейсе 15. ........ 
of (15) factors, such аз vetbal ability, у, and 
practical ability (16). — ^ ' ER eae ЫА 


17-25. Write a + sign in the margin after any of the following 


> 


* 


р 


° & 
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statementg which are true, and a — sign after нове whicheare 
false. 

17. An omnibus test is @ group intelligence test e 5 
. in Which items of many Кіпф are mixed ир ........ 
18. A majority of Very dull children are below ^ 
averagé in health and узо develops 3 UE 
meat . e КАН” 

‹ 19. Регїогтапсе tests are applied by уо 

o gists chiefly in order to measure a child's 

practical ability in everyday life Dare 
20. In order to measure the average intelligence t Е 
of а class of 7-year-old children, а teacher 
would be best advised to us& а group teste ees 
with pictorial material 0.00. е... 
21. In a normal frequency distribution сийе the 
semi- -interquartile range and the standard 
deviation are identical 7 EN E E 
22. University students who obtain scholafships 
on the basis “of Competitive examinations 


as often as not obtain no better degree 5 
ra than those who fail to win such 3 .. 
scholarships . Ў НЫНЕ 


. The advantage of scoring ‘abilities in DUE. ous: 
of Mental Age or Educational Age is that 
* the units of measurement are all equal to < 
one another . Mido tices 
24. A group of a hundred children picked ‘out as 
being especially superior in one school 
subject Will practically never show as great 
7^ ап amount'of average superiority in апу 
other subject = deh TEE Ey 
25,» The Probable Error of a pupil's English сот- 
position mark is generally likely to be. 
smaller thart that of his arithmetic mark . *........ 


26. Write” the letter (A, B, ог ©; etc.) of the correct answer in the 
« margin. « 
*26. Group intelligence tests, as contrasted vid $ 
the Stanford-Binet test, are: . d Er s d 
(A) More varied in content © D 
(B) Better measures of children’s inteligence t 
ЄС) More objectively scored o 
(D) Moore difficult to apply 
(E) Less suitable for application to ФК: ја 
. adults e 7 


«27-32. А set of 51 pupils took tkrée examinations, A, 
B, and. С. The following distributions of, marks 2 
ere obtained: 


216 


33. А group inteligence test was a in a 
secondary school, and the occupational , 
preferences of all the pupils were ascer- 


7 


Mark A B б. 
87+. 1 1 3 
80--. 3 2 5. 
B+, 4 4 7 
66+: 18 nus 
594 . ЕТЕНЕ 
v Cea a. 5 14. 8 
454. cese) 4 
38 + > 1 3 2 
еВ О 


32. 
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None? 


skewed, A, B, C, or None? 


> them did he do relatively best? 


take as your arbitrary mean? 


figure would be multiplied by 2 (i.e. 
by the number of pupils who score 
» 38 +)? ЭТА 


tained and classified under ten headings. 


In order to find whetherthere is any cor- >, 
respondence between occupational prefer- 
ence and intelligence, which of the follow- 
ing statistical Кош could best be 


applied? 


> (A) Bisérial r. 

(B) Product moment йен 

(С) Analysis of variance. 

(D) Consingency coefücient. 

(E) Correlation ratio. 

(F) Multiple correlation. 

734. Which is the least. serious of the following 
defects in intelligence quotients obtains 


from group intelligence tests? . 


. Which of*the distributions approximates 
ә fairly closely to normal, A, В, C, or 


. Which of the distributions i is negatively 
‚ Which of the distributions is likely to 
have the smallest standard deviation? 
. А certain pupil happened to get 70% on 
all three examinations, In which of 


. If you were calculating the average of 
Distribution C, what figure would you 


In calculating this same average, what 
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(A)eGroup test Scores are considerably affec- + * 
ted by previous practice on similar tests. 
(B) ihe standard deviation of group test * . 
E 1.0.5 differs condere d in different 
tests. 
(C) Few ôf the published tests are adequately Y 
< standardized, or have reliable norms? А ‘ 
« (D) Group tests are generally given with a ot 
° time limit, and the child’ who is quick- < 
est is not necessarily the most intelli- , 
« gent. e <> . 


о 35. Which one of the following statements js un- ° 
д true? Te distribution of marks given by 
an examiner to a class Of fifty pupils may 
be very irregular or skewed because: S DE LAS 
(А) The number of cases is too small for а 
smooth, normal distribution tb be = 
expected: © 
(B) The examiner's opinion of the TRE Y: 
merits of the scripts is a personal one. “ 
(С) The pupils’ abilities may be asymmotric-. * 
+ ally distributed, as when, for example, > 
the class contains a long tail. ү: 
' 40) The зсаје ОЁ marks which the examiner < 
adopts may be lacking in any rational 
* basis. 


36. In an experiment on the scores of gradüate 
and non-geaduate teachers on verbal and 
non-verbal group intelligence tests, these. 
-results Were obtained: 


LI * 
Verbal Tests Non-verbal Tests 
Graduates . * . . 1539 786“ 
• Non-graduates *. . 143-5 77:0 
Р.Е. of the difference { 
e ‘between means . zo «12 
Which of the following conchísions may Tegi- «© 
timately be drawn? quee 
(A) Graduates dg better than non-gfaduates Б 
ас intelligente tests. 
(B) All teachers do better at verbal than at 
non-verbal tests. VETE. ее) a 
(C) Scorgs on verbal tests are improved Бу, , 
taking a university degree. б T 
(D) Teachers with university training have ¢ — 
t по appreciable advantage over others > 
оп non-verbal tests. z у 


2 
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= 7. 


? LQ. LQ 
(С) (D) 

37-41. The abóve pairs of curves (A-D) represent the approximate 
frequency distributions of the intelligence quotients of various | 
groups of children. Below are listed several such groups. | 
Opposite four of them write the letter of the pair of curves 
which | corresponds. Write *None' opposite the groups to which 
no pair of curves corresponds. 


37. One thousand elementary school pupils, and · 
one thousand special school pupils . Xo. 
38. АП elementary school pupils and all special ео 
š school pupilssin the country . 0. ....... 
39. All elementary school pupils and all ‘children 
OESTE CORNY E N ДӨ Кз Pa het И 
* 40. АП special school and ан кышу school 
^ pupils in the country — . БИРЕ 
41. All е]етёпїагу schodl and all secondary 
2 school pupils in the country . (А Та s 


42-47. Below is given a list of psychologists (A-F), ан with a 
list of important pieces of work in educational psychology. 

After each of the latter write the letter,of the psychologist who 

к was mainly responsible for, or mun you chiefly associate with 


m. 
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this piece pf work. Two letters may be writen if two psycholo- 
gists were prominently associated. Some letters may be used 


twice or more, sonfe not аай. 5 “ а: x 
. (А) Ballard 42. Development of the con; 25. 
(В) Binet а .  ceptiog of the Intelli* © 
(C) Burt gence Qyotient . рае 
(D) DreverfCo]lins 43. Development of a scale, x 
i (E)* Terman * of performance tests . ..... oh 
(Е) Thurstone 44. Inyestigations by corre- . 
lations 1п{о the factors — , 
‹ underlying abilities. . <,....: ^ 
45. Advocacy af new-type ex- 
$ ° A < aminations. e EIU 
° . 46. Construction “of educh- 
tional attainments tests ..... Wen 4 
47. Re-standardization of the CAI 
‹ Binet-Simon scale ^... ........ > 


48-54. An arithmetic test, a handwriting test, and two „group 
intelligence tests were given to an ordinary. school class, and 
the following correlations between test scores were worked outs, « 


(A) Combined intelligence tests with handwriting test. $t 


(B) Combined intelligence tests with arithmetic test. % 
(С) .Combined inflligence tests with chronological age. 
(D) One intelligence test with the other. 

(E) Arithmetic test with handwriting test. © 


О 


(F) Handwriting test with chronological age, “ 
(С) One intelligence test with arithmetic test. e 
Which pair woufà yow expect to yield: — * 
48. The highest correlation? . $ Е x 
49, The secqnd highest? . ^ A E gu. 
50. The third? Д 2 д : Н ^ ji 
N x The fourth? +. д А : 5 2 
У 2, The fifth? SETS 3 Е $ 5 
25 The sixth? А у : NUES 
4 «54. The lowest correlation? 526 2. Lec S AES UA 
2 LI 
DEFECTS, OF NEW-TYPE EXAMINATIONS , . 
So far only the advantages‘ of new-tyfe examinations have been Я 


mentióned, and we must now consider crititisms of them in detail. 
Two minor “objections will be admitted at once. First, the cost of 
. Printing or duplicafing а new-type test is considerably greater {һап « 
that of an essay-type examination, For classroom purposes, however, 
. oral presentation:of the questiorís tan often be adopted. If several 
, Brdüps of pupils or students are to be exarhined, they can write Шеіг 
= 4 "er 
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answers on blarik sheets, and their test papers can he used several 
times over, 
Secondly, much-greater time and skill are needed for constructing 
a good new-type test. The major criticisms which are to be discussed 
‘below apply only too often t» inferior tests which have been rapidly 
constructed by inexperienced amateurs. It is high'y desirable that 
teachers should be trained in tne principles of testing during their 
training college courses. The time needed in setting a paper is, how- 
ever, more than offset by the time saved in marking, when the num- 
ber of examinees is large. Indeed, since the marking should be fool- 
proof, there is no reason why it should not be done by clerks or 
secretaries. In large pwolic examinations, the authorities could save 
much expense by paying professional examiners only to set, not to 
correct, the papers. The present writer finds that the construction of 
а one-hour’s пеу-іуре test in educational psychology (such as that 
appended above) takes him about twenty hours. Setting an essay- 
examination scarcely needs one hour. But in marking these one- 
hcur examinations he can easily do thirty of the foriner as against 
ten or less of the latter scripts per hour. He, therefore, finds no saving 
in total hours of work unless the number of examinees approximates 
to 300 or more. But the setting of a new-type test is a fascinating 
occupation, which can be done in odd moments „throughout the 
‚ year; and the marking is simply a routine matter which involves no 
mental strain. By contrast, the marking of large numbers of essay-type 
scripts in psychology is the позе гута wo:k that he ever has to do. 
The Guessing Factor.—Next we must admit that examinees may 
» obtain a certain number of correct answers (im a recogn:tion-type 
test) by pure chance guessing. If they filled in answers at random 
^ they would, on the average, achieve half-marks on the questions 
where two responses are provided (such as the true-false type), and 
quarter-marks on the questions where four responses are provided. 
This fact disturbs the teachér far more than it does the psychologist. 
> For the latter realizes that every examinee hes the same chance of 
scoring, let us say, 35%, by random gucssing ; he therefore regards 
33% as the effective zero point and, only gives credit to scores above 
this level. An alternative proceduré, which comes to much the same 


‚ thing, but is fairer in that it does not penalize the examinee who omits 


questions rather than guess them, is to subtract the total wrong 
responses in ttue-false questions from the total right ones; to sub- 
| tract half the total wrong in three-response questions, one-third in 
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four-response questions, aad so on. That is te say, an examinee’s 
score corrected far guessing — R — жалар pne Ris the tofal tight, . 


W thé total wrong, and mis the ice of alternatiye responses ^ 
provided in each questions In general prastice the correction is. 
omitted in four;,or more response questions, since it is so Small ы 
But (ће problem’ of guessing is #%отріех one whiclf cannot „Бе 
entirely settled by this mechanical correction. It has aroused wide- 
spread controversy, and a long discussion of itmay be found in Ruch 
(1929). Although the above correction will compensate «ог the 
« effects of guessing in the average examinee, yet it will not do sodn 
all cases, sincé some‘will be luckier than others, in aceordance with 
the principles of probability. Suppose that a large group of persons 
guesses the responses to ten tgue-false items; the average person wille 
get 5 of them right, and two-thirds of the group wall get either 4, 5, 
or б right.* Only about one person in a thousand will be so lucky as 
to get 10 right, or so unlucky as to hit on none of the right answers. 
If the guessing cgrrection is applied, the scores corresponding to 10, ‹ 
6, 5, 4, and O right will be +10, +2, 0, —2, and = 10 respectively. 
In a longer test the range of chance scores which may arise from 
guessifig will be still greaftr, although their average (when corrected) 
will still be zero. Undoubtedly, then, we must expect true-false tests 
to be somewhat unreliable, but the effect will be less marked in tests 
composed of three-,or four-response items. Experimenta], evidence 
does indectl show trueSalse fests fo be less reliable than multiple- 
choice tests; and the latter are less reliable than tests made up of 
recall items. But the.difference between them is not lacge, and it may « 
easily be offset by the fact that a true-false test can contain more items 
than a multiple-choice test which occupies the same length of time, 
since trut-false items can be answered more quickly. Probably a 
géod deal depends, also, on the skill af the tester in setting each | 
particular type of question. Я 
Now, thoughtwe must aHow that a considerable error may occur . 
among the scores,of a smalf proportion ef examinees on account of 


' Note that there is no point in making any gwessing correction when о Ш 
examinees attempt an answer to every question. Their relative marks will 
affected by the correction only when they omit some questions. In general, then, 

2 Correction becomes s less necessary the longer the time allowance for the exaínina- « 
tion. 

*.. “Тһе situation is effectively the sanje д ‘as getting the persos each to toss a 
ев ten times, and counting up how many of them obtain 0, 1, 2, . „9, 10 

6265, 


ee 


< о Cora. 


с 


с 
Г 


с 


22 | , THE MEASWREMENT OF ABILITIES 


guessing, in actual practice entirely random guessing (which was 
assumed in the previous paragraph) will very seldom occur. The 
‚ examineé is more»likely to possess ап incompléte knowledge of 
certain items, sufficient, however, to make his guesses right ones. 
Teachers may regard її a$ unfair that such an zxaminee will obtain 
the Same marks on those items as will the тәге able examinee who 
really knows the right answers. "But a brief study of recognition- type 
questions, such as those given above, will show that the examinee's 
shaky knowledge may rot always be to his advantage. For the 
alternative (wrong) answers are usually worded sufficiently plausibly 
to:deceive the weak studerit, and should indeed be based on common 
misconceptions. Hence his 'semi-guesses' wil} in practice be very 
frequently wrong. For this reason experimental researches indicate 
that the guessing correction fully compensates, or even over- 
compensates, for'any advantages which the weak student might gain 
by guessing. Other investigations prove that the amount of guessing 
may be greatly rsduced, and the reliability and validity of the test 
` slightly increased, by instructing examinees not to guess, and by 
warning them that" wrong guesses will be penalized. 

To sum up: random guessing may considerably distort the marks 
on a resognition-type test, especially when the questions provide 
only two alternative'answers. It does not actually do so to any great 

ə extent, partly because guessing scarcely ever is random, partly be- 
causeit can be reduced by appropriate instructions, and partly because 
such tests can include large numbers of items Which make for increased 
reliability. If the recognition-type of item is to be condemned, it must 

» be primarily ов aceount of some of the other reasons discuss*d below. 
. Suggestive Effects of Wrong Responses—Another common 
criticism which we can refute is that the presentation of fa!se state- 
ments and wrong answers to the suggestible minds of pupils is 
pedagogically unsound. It i is alleged that they may absorb such state- 
merits and later on believe them to be correct. This theory is ап а 

“ priori one, and exporiments which have been: devised to try it out 
һауе always failed to provide confirmation. Probably the fact that 
examinees know that about half the true-false statements, and more 
of the multiple-choicé responses, are wrong is sufficient to offset the 

о effects of suggestion. Many teachers find that there is considerable 
pedagogical value in getting pupils and students to correct their own 
papers shortly after the test, so letting them see which ы аге 
_ wrong (cf. Ballard, 1923). 
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The Elementaristic Tendency in New-type Tests. —We come полу to 
the most obvious and most frequent objection to new-type ! tests. It 
is said that they measure only the most trivial aspects of ability, such % 
as details of information. ‘The tests тау indeed analyse complex. 
mental operations irto elements whicl? can Be objectively marked, ', 
but they cannot'put thé elements together again, and th the process 
of analysis they omit many of the most important aspects of ability. 
They сапћог, as can the essay-type of examination, show the ex- 
aminee’s general understanding of the subject/'nor his i interpretation 
of facts, his capacity for organizing and formulating his knowledge, 


nor his initiatiye and, originality. Far from reducing the tendency 6f o 


examinees to cram nfechanically, they may Increase the amount of 
rote memorization and discourage any desire to achieve a general 
grasp of the field. Investigatioft has, indeed, shown that students revise “ 
their work in a different manner when they know they are to sit a 
new-type examination, and that they concentrate on acquiring such 
bits of knowledge as are most likely to appear in thé test questions, 
It is often alleged, though without proof, that secondary pupils who" ^ 
have been trained in the primary schools to pass objective selection 
tests find considerable difficulties in adjusting to more normal 
forms of school learning. ы 

Such criticisms embody a number of half-truths; and may be 
countered by a number of arguments. The retort may first be made o 
that even if the new-type test cannot measure *undetstanding, 
originality, etc., the essay examination cannot do so either. For 
when examiners come to assess such qualities, they dis&gree widely 
with one“another. The orthodox examination асћіеуёѕ a reasonable " 
degree of reliability when it most closely approximates to a new-type „ 
tést, i.e.when it asks primarily for factual data, or when the ex- 
aminers draw up a detailed analytic scheme of the points which they 
Wish«o mark. Secondly, it may be poifited out that examinees en- 
gaged on an essay examination do not usually shine in regard to 
organization of knowledge and original thinkitg. Rather they dash * 
down call the ideas and facts ‘which they can recall, many of thém 
irrelevant. ‘Thinking’ as contrasted with ‘reproduction of items^of. 
information' is at a minimum. The new-type tést taps all the relevant 
knowledge, with mach less waste of time, and prevents the candidates © 
from producing irrelevant material. $ 

Such recriminations, are also, however, only partially justified. 
Müch sounder is the argument which points to weaknesses in the, 
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critics’ -psychologica! analysis of the contrasted examinations. They 
commonly draw a sharp distinction between information and repro- 
duction, on the one hand, and understanding and thinking, on the 
other hand. They assume, that is, that these mental processes are 
‘uncorrelated. In actuàl practice the two cannot be separated so 
readily. The ‘acquisition and reproduction vf information always 
involves a certain amount of thinkirlg, and no thinking is possible 
unless а person possesses information to think about. Hence, 
experiraents in which’ attempts have been made to measure these 
faculties separately have usüally shown them to be highly inter- 
correlated. It follows, therefore, that even when a new-type test is 
apparently measuring nothing but information; it is at the same time 
providing a pretty good measure of other more complex types of 
' ability. It cannot ask examinees to ‘Discuss .. ., Compare . . ., etc." 
nor make any оўегї provision for independent thinking, yet it is 
certainly drawing to a considerable extent upon these ‘higher’ mental 
processes which the orthodox examiner desires to assess. Conversely, 
‘a large proportion of the average essay-type answer Consists of the 
sort of material which critics affect to despise when it is accurately 
measured by new-type tests. 

The overlapping between the two kinds of mental processes is 
further demonstrated by the fact that a new-type response, or a sen- 

» tence in an essay answer, may depend upon ‘understanding’ in one 
examineey but may result from ‘facile reproduction’ in another 
examinee, according to the respective methods by which these 
examinees have studied the topic in question. Nevertheless, it seems 
"legitimate to mlake a partial distinction betweem'the two types, and 

, to admit that examinees have more opportunity of displaying ‘higher’ 
mental processes in essay-type than in most new-type examinations. 
And we must remember that, eyen if they do not usually write good 
English in their essay papeis, yet they would probably never Jearn 
to write English at all if new-type examinations became the sole 

> method of testing their abilities. This does indeed tend to occur 
among primary pupils in éducation areas where the selection ex- 
aminations are purely" objective, and some authorities have re- 
introduced the composition examination in order to counter it. 

» The reason why the criticisms of triviality and fragmentariness 
seem at first sight so justifiable and so serious is because many new- 
type tests are incompetently coristructed. Questions which ask for 
bits of information are much the easiest to set, and, therefore, ‘do 
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only too frequently constitute the bulk of home;matle examinations. 
But there is nothing inherent in new-type examining which makes 
better questions. finattainable. And the writer would claint that. at ү 
least half the questions in his own test, above, do demand not.only 
the possession of information, but alsosth€ application of principles. 
or the interpretation of information; that is to say, ыны 

To conclude, then, the defect which we have been-discussing i is ¢ 
which may occur in new-type tests, but néed not do so to any Ae 
extent. Its effect upon the validity of the fests has been greatly ex- 
aggerated by critics owing to their faulty views of the nature arid 
organization of mental abilities. 9 а 

Defects Due to the New-type Medium of ‘Expression. —Our next 
objection is of a different order. It does not try to point out some- 
thing which new-type tests fail to measure, but rather to show that« 
they measuré a good deal over and above the abilities which we 
actually want to assess; and'that this extra factor, having little 
educational significance, seriously distorts and reduces the validity 
of all new-type test results (cf. pp. 144, 168). Some of the most ardent, : 
advocates of,new-type testing, e.g. Ballard (1923), do not seem to 
have taken this criticism into account. An excellent account óf it may 
be found in Hawkes and Lindquist ег а! (1937). B 

If the reader will analyse carefully the processes which take place 
in his own mind when answering the new-type questions above, he is 
likely to find a good deal going on besides the recollection and 
application of his КпоуЛедре of the subject. He must first study the 
instructions to discover what he has to do in each question, and then 
manipulate his knowledge in such a way as to fit itin with the require- с 
ments of the question. In a multiple-choice question he may arrive 
at the right response, not because he is certain that he knows it, but ^ 
because he has eliminated the incorrect ones. Sometimes a single 
Word in the question or in the provided»response will enable him to 
choose correctly, without his having to think out the full implications 
of all that is printed..Some questions may, unknown to the author, , 
contain clues to.the correct: responses ‘which, do not require, any - 
knowiedge whatsoever of the subject. But if the reader is a sufficienily 
sophisticated new-type ехатіпеё, he will at dnce take advantage of 
such faults in construction. Examples of these irrelevant clues аге, 
given in the next chapter. Unfortunately, the questions which are 
most liable to contain such errors аге those that‘ aim to evoke 


‘understanding’ ratherthan ‘reproduction.’ Straightforward questions 
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which . ask primarily for factual information (e.g. Nos. 27-33 
‘and 42-47) are generally free from them. But more elaborate ques- 
7-2 tions (eg. Nos. 34-41, 48-54) involve a great deal of ‘disentangling’ 
and,complex inference. The ability to do:such questions may depend 
"rather largely on thesexaminee’s previous experience and practice 
4 with such tests, or upon some unknown group.factor, and not 
solely on his knowledge of the'subject being tested. 
Such sources of unreliability and invalidity may be much reduced 
if the examiner is sufficigntly skilled in anticipating all the diverse 
mental processes which examinees may employ, but they are likely 
tg be eliminated only if ñe falls back on questions of the simple- 
recall type, and asks merely for scraps of information. Actually no 
examiner can foresee all the possible weaknesses in his questions, for 
investigation shows that even tests созѕігисіей by experts may con- 
tain items whose-validity is very low, or sometimes negative. That is, 
these items are done no better, and sometimes worse, by good than 
by poor students, A difficult multiple-choice item, for example, may 
«contain wrong alternatives which are sufficiently plausible to deceive 
the majority of good students, but which are beyond the compre- 
hensiomof the poor students, who consequently just guess, and often 
guess rightly. Or the correct answer may contain some unsuspected 
ambiguity which is not noticed by mediocre students, who therefore 
choose it, but which is noticed by good students, who therefore reject 
it. Such faults ih construction are still more likely to occur in tests 
devised by amateurs. 5 55-4 $ 
It was shown in the previous chapter that тапу of the defects of 
» the orthodox examination derived from the fact shat examigees have 
Р to formulate or express their knowledge and ability in a distorting 
medium, namely essays, and that examiners disagree in their inter- 
pretation and evaluation of these complex mental products. We see 
now that much the same is true of new-type examinations. Examjineés 
have to express their knowlédge and ability in a different but perhaps 
‚ stilt more artificial medium. This ‘roundaboutness;’ and the com- 
plexjty of the mental „processes involved іп arriving at their answers, 
urldoubtedly reduce thesvalidity of their marks. In examining aVerage 
and dull pupils or adults it is not correct to claim that the new-type 
„ exathination is unaffected by their poor literacy. For though writing 
is,largely eliminated, the examinee has far more reading to do. 
The Subjectivity of New-type Tests.—A final defect іп new-type 
tests, which has beer! entirely neglected by the majority of writers, is 
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that they аге, ip a sense, just as subjective as essay-type examinations. 
Only the personal element enters, not in the marking, but in the 


setting of the quéstions. РиШа$ (1937) has demonstrated that when % 


two Or more examiners construct their буп new-type, test papers, 
intending to cover psecisely the same field“of Knowledge and ability; 
and apply-themeto thessame pupils or students, the average corfela- 
tion between the two sets of marks Is only of the order" of --0: ):50. 
This figufe may be unduly low because tlie examiners were teachers 
who may not have been highly skilled in test construction. Yet some 
thirty-five different teachers took part in the experiment, ali of whom 
habijually employed new-type tests, and the results were very simitar 
in different school subjects, and among pupil$ of various ages. Pullias 
further found that standardized educational tests, Which were alleged 
to measure ability in the same school subjects, only yielded an^ 
average intef-correlation of +0-68. In some imstances «the co- 
efficients were decidedly higher (-I- 0-8 to ++ 0-9), in others very much 
lower (+ 0:2 to + 0:3). ao 


c 


How may we account for such results? Actually the sources of una ( 


reliability described on pp. 200-7 apply here, too. These were: 

1. Actual changes in the examinees, or function fluctuation? 

2. Inadequate sampling. As already mentioned (p. 212), this 
should be muck, less prominent in new- than іп“евваугіуре examina- 
tions, especially if the instructions given at the beginning of the next 
chapter are followed. The trouble is not so much lack of thorough- 
ness of thé questions a$*the subjestivity of their selection. 

3. Statistical defects. These can arise, not because different exam- 
iners adépt ‘different scales of marks, but because the average level, с 
and the range, of difficulty among their questions may differ. Such 
inconsistencies can, however, readily be eliminated, and they did not 
affect Pullias’s correlations. A 

“4. Subjectivity. The essay-examination is subjective because the 
examiner has to апајуве the field of ability and decide which aspects 
are most important, and “his personal opinien determines which . 
characteristics of she examinees shall be assessed as good or bad. The 
new-t§pe examination is subjective because the examiner has to pêr- 
form a similar process of analysis in choosing questions, and in pro- 
viding the good and bad answers between which the examinees are , 
to discriminate? Again, the essay-examiner is unable to maintain 
ЕВ because he must translate the nS présented to him, 


ort penetrate through tlie medium of expression’ in order to discover 
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the,examinees' feal,merit. The new-type examiner carries out an 
equivalent translation when formulating his questions. Hence differ- 


2 ent exaihiners will-formulate questions about a given topic differ- 


ently, in just the same way that differenteessay-examiners will trans- 
‘late, differently the answers to an essay4questicn on this topic. 

Fullias points out that the main stages in examiining—(a) con- 
structing questions, (b) answering questions, (c) evaluating arswers— 
in reality constitute a single whole. The new-type examiner has, in- 
deed, eliminated subjectiyity from (c), but at the cost of greater sub- 
jéctivity in (а). And his personal opinion has inserted itself into (b), 
since his questions embody a good deal of what, in essay-examina- 
tions, is normally carried out by the examinees. 

Now that we recognize this subjectivity as the chief flaw in new- 
type tests, we should be able to control and reduce it more readily 
than in essay-examining. Nevertheless, we must certainly conclude 
that both types of examination are imperfect mental measuring in- 
struments, and: open to much (һе same objections. Conclusive 
evidence as to their relative validity is not easy to obtain, since usually 
it is only possible to compare their marks with the marks on sub- 
sequent; equally fallible, examinations. But by pooling the marks 
from several papers of both types, from class teachers’ estimates, etc., 
a fairly good. criterion may be achieved. And according to this, 
neither the essay-type nor the new-type is definitely superior (cf. 
Lee and.Symonds, 1933). The evidence of the superior validity 
of new-type tests which Ballard (1923) puts forward is quite uncon- 
vincing. In-secondary school selection conventional, subjectively 


“marked examinations are certainly inferior to objective jests. But 


„ when the marking is standardized there is little to choose ; sometimes 


the old type, sometimes the new, show better correlations with later 
secondary school achievement. Certainly there is no proof that the 
application of new-type tests in university entrance or scholarship 
examinations would give appreciably better, or poorer predictions of 


‚ ultimate university success than at present. Since, however, the errors 


) 


which reduce the validities of the two are different, it follows that 
they measure somewha: different aspects of ability. Hence by com- 
bining them the respective errors partially cancel one another out, 
and'the total validity is definitely superior to that of either of the 
separate types. And as certain features of a scholastic subject may be 
more readily tested by the one: other features by the other, each 
should be employed in the spheres to whiclrit is best adapted. 
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о 9. А Ф 3: am 
THE first stage in the construction of a new-type examination should 
be the preparation of a detailed statement of what the examjnation 
is intended to measure. A list of the topics which it is to cover should 
be written out, and a rough assessment made of their relative impor- 
tance, for ехапіріе оп’ 1 to 5 scale. Syppose that the sum of all these 
assessments comes to 75, and that the examination is to last two 
hours. Probably about 150 questions will be needed, In that case ' 
each topic assessed as І in importance should eventually be covered 
by two questions, each assessed as 5 by ten questions, and so on. 
Naturally this distribution of questions need not be enforced mech- 


anically ; but the,principle of allotting most questions to the most, - 


significant parts of the field is important. If the test is designed simply 
to show the acquaintance of students with a particular textbook the 
questions dealing with each topic may be proportioned to the number 
of pages on the topic in the textbook. Я 5 р 

We have already stressed the point that questions should not deal 
with trivial details merely because these are the easiest to. set. It is 
very unwise also for teashers,to take sentences straight out of text- 
books and turn them into true-false or completion items. If this is 
done, the; examinees will assuredly begin to study their textbooks ~ 
with an eye to such sentences, and will try to learn them ой by rote 


rather than to understand them. Textbook sentences may occasion- ` 


ally be adapted as wrong answers in multiple-choice questions, since 
they may then trap the parrot-learnet. „ м Ў 

When the teacher and the examiner aré one and the same person, 
suitable questions, ог jdeas-for questions, are likely to occur to him 
While he is conducting the work of the class. These should be noted 
down and filed; they will save him much ле and effort when the 
examination ‘has to be made up. No examine): should expect to be 


able to devise a complete test at one sitting, unless it is intended to . 


last only about а quarter of an hour. He will soon discover for him- 

self his speed of work, and will set aside sufficient hours, on a number 

of different days, to enable him to сату out the construction 
М.А.—16 4 к 
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efficiently. The Яше needed will, of course, be much reduced if ће 
can take over some ofthe questions from previous papers which he has 
“> set in the same subject. This is often possible since the papers are 
. usually collected at the end of the examination, and not allowed to 
circulate freely roundtheschool or university. Before he enters upon 
therfinal stage.of construction, he should have,in hand roughly double 
the number of questions ће is'Zikely;to need. Many will have to be 
rejected for one reason or another, hence he requires plenty of spare 
ones. , 99% 
* То reduce the subjective,element іп setting, it is desirable for 
several examiners to co-operate in supplying questions. In an institu- 
tion or department where examinations are frequenily set to large 
numbers, it is useful to build up a pool of questions from which a 
; number of new-type papers сап be drawn as the need arises. Each 
question is recorded on a card in a classified file, with details as to 
the examinations in which it has been used, and any data regarding 
its difficulty and yalidity. Periodically, questions that have been used 
„too often are withdrawn, and fresh ones added to keep up with any 
alterations in the course of instruction. 
The choice of questions for any one paper is governed by quite 
a different conception of difficulty from that which prevails in tra- 
ditional examinations, In accordance with the principles given in 
Chap. II, the test should be so arranged that the poorest examinees 
will obtajn matks very little above 0%, the best ones very little 
below 100%, and so that the аусгаре will Ye as near as possible to 
50%. (These figures may then later be converted into some more 
» conventional scale.) A few questions, therefore, must bg so сазу 
that almost all (say 95%) can manage them, a few so difficult 
that scarcely any (say 5%) can manage them. Moreover, all inter- 
mediate grades of difficulty must be adequately represénted. By 
contrast, all thé questions,in an ordinary examination are usually 
arranged to be roughly equivalent in difficulty. It follows that 
ә in a new-type, examination, several seotions,of the work may be 
omitted altogether if they are likely to be known so well that more 
Щап 95% of the students would answer questions on them созгесИу. 
It is a pure waste of stime to include items which everyone can do. 
5 Conversely, sections of the work may have to be omitted if even the 
best 5% are unlikely to be able to answer quéstions dealing with 
them; such questions would also be useless. 
Thus the examiner’s next step must be to grade each of his tentative 


> 
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questions for difficulty, i.e. to assess what Proportions of examinees 
are likely to get each one right, The grading may be quite coarse, е.р. 
А = 5 to 2075, B = 21 te 40%, C= 41 to.60¥, D — 6140 80%, о 
Е = 81 to 95%. These estimates are likelf to be highly inaccurate at - 
first, and to diverge widely*from similar“estimates made by ether 
examiners But åf he takes the trouble to*check his predictions After 
each paper, and notes how many éXaminees did actually pass each 
item, his tapacity will soon improve. The final version of the paper 
should contain approximately equal numbers of A, B, C, D; and E 


items.! Ы € e 


What particular form of new-type question to adopt depends én- „ 
tirely upon the subject-matter. Some things arg mote readily ex- 
pressed in one form, some in another. No certain evidence of«the 
superiority or inferiority of any type is as yet available. Experiments * 
on this mattef have been carried out; but the skilFof the authors of 
the items has not been controlled. Suppose, for example, that mul- 
tiple-choice items are found to be more reliable and valid than true- 
false ones, this may merely show that the experimenter is himself. * 
better at compiling multiple-choice items. Nor can we definitely 
claim that different types measure distinctive abilities. The correla- 
tions between two multiple-choice tests, or two true-false ¢ests, do 
not appear to be higher than those between а nfultiple-choice anda  - 
true-false test. This indicates that they both measure the same thing. « 

There may, however, still be some advantage in probing the mind 
by a variety of techniqtfes. And several different types may be em- 
ployed in a long paper so às to reduce monotony. In a short paper 
this is urflesirable, since each type needs to be prefaced by full in- © 
structions, and examinees may have to spend too large a proportion „ 
of time in reading the instructions, and in readjusting their *mental 
Set.’ The following rule may be taken as a rough, though not a rigid, 
güide, If matching items are included, there should be at least five of 
them, each including 5 to 10 responses. ff multiple-choice items are 
included, there Should be at least ten of them. df trúe-false, simple < 
recall, or completion items afe included, there should be at least 
twenty of each. The numbers of other types'of questions should be 
comparable to these figures. DE i 


‘If, however, the object of the examination is to select or cut off, say, thé best © 
20%, or the poorest 30%, then maximum, efficiency will be obtained by aiming 
as many questions as possible at thes& respective levels of difficulty, and not 
TY ing about their suitability for spreading out the marks of the majority 
ch. p.31. . 2 t 
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When the questions have all been chosen they should be rearranged 
so that all those of any one type are grouped together. In each 
‚ group {Йе questions should be arranged in estimated order of 
difficulty, starting with the easiest. If this precaution is omitted, the 
poor and medium examinees may waste а lot-of time attempting 
difficult items»at the start, and fail to reach vasier enes which they 
could do, and the reliability of their zesults will be reduced.. 
We will now consider each of the main types. * 


SYMPLE RECALL AND OPEN-COMPLETION TYPE 


"A question or short problem is followed by a blank space where 
the answer 15 to he wfitten. Alternatively certain words or short 
phrases are omitted from a sentence or paragraph, leaving blank 
‘spaces to be filled in. For example:! » 


la. Who was the inventor of the steam engine? ........ 
In. The inventor of the steam engihe Was ........ 


2. fy зе $2 + 3; what 93 Е. 5 


3. Ohm's Law states that ........ X Resistance — ........ 
4."When a salt is dissolved in water the molecules br 
into .. ait A positive electric charge is'carried by the, ........ , 
a negative charge by the ........ A balanced action takes place 
whereby XY 2* ........ d HITS RSEN theory provides an explana- 
» tion of the ........ of electricity by salt solutions. 

5. А geography test may consist of a map°on which numbers, 
but no names, аге printed. Below is a eolurifn of these numbers and 
the examinee is instructed to write opposite each the town or river 

„ Which it represenfs. The same technique can ре used Гог testing 
knowledge of the names of parts of some instrument, mechanism or 
» biological specimen. 

Each correct answer receives one mark; thus 6 marks may be 
scored on Question 4, s › Р 
_ The advantages of this type of item are, first, its naturalness and 
its similarity to ordinary questioning; secondly, its freedom from the 
chance guessing effécts which occur in-closed or recognition items ; 


aad’ thirdly › the facility with which it may be constructed. This 
facility, however, is somewhat delusive, ard unsuspected flaws are 


» 17е author feels that an apology may be needed for the puerility of some of 
the illustrative questions in this chapter. They are intended to show the applica- 
tion of the technique to various School dnd university subjects, at varying levels 
of difficulty. Preferring, to construc? fresh ones rather than to quete from pub- 

Е lished tests, he is limited by the rustiness of his own education. Ў 

$12 у 
B 


> $ > 


є © Sx 
c CONSTRUCTION OF NEW-TYPE EXAMINATIONS 233 
very likely {0 сгеер in. The chief disadvantagecof the type is that its 
Scope is practically limited to rote knowledge, or skill at simple, 
arithmetical and'scientifictproBlems. ee ia 
А number of hints on construction máy be listed asdfollows: • 
(a) Try to ask orfly for important items of knowledge. PASSE 

(5) Ensure as far a$ possible that there is only оп right answer, 

қ and therefore confine the answers to single words 6r numbers, o to 
very shoft phrases. Some of the blanks in Question 4 violaté this 
rule. The last one might be filled by ‘conduction,’ ‘carriage,’ or 
‘carrying.’ If there is any likelihood ‘of alternatives, the éxaminer 

< shovld list them all beforehand, and decide which ones toallow‘as © 
correct. In this instafice he would have to decide whether to disallow 
‘passage.’ Examiners may find it easier to conform to both these gules, 
if they think of the answer first, and then build up the question or 
sentence around it, in such a way that the ansWwer is completely с 
delimited. 4 a 
(c) Keep the questions and sentences as short arfi ‘as unambiguous 
as possible, ahd 4n particular avoid over-mutilatjon of a completion’ ' 
sentence by inserting too many blanks. A completion sentence should 
always be looked on as a form of question. See therefore that sufficient 
words are*left to make it an intelligible question. For the same- 
reason, do noteput blanks near the beginning of а sentence, which 
can only be answered by consulting later clauses of.the sentence, or « 
later sentences inthe paragraph. Most of the blanks should be at the 

end of {һе sentences. y m 4 Ў 5 

(4) Study, the whole content of a sentence or question to see that 
all of 115 essential for the production of a correct response. In* 
badly constructed items, it may be possible to deduce the response « 
simply by the exercise of intelligence, without any knowledge of the 
subject-matter, In others the correct response may be given away 

Somewhere т a later sentence; ог hinfs may be derived from the 

grammatical construction. For example, blank spaces should nóf be 

preceded by 'а" ог “ап? An irrelevant clue is provided by ‘the ‘an’ in 
the following example: 7% E * rae 


О 
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. This difficulty could be overcome by wording the sentence аз a < 
* э е 


г question, or by printing *a(n)-- 5. $e. s uy " 
í (е) Do not vary the length of the blank space according to the 
sizé of the answer expected, as this would also supply irrelevant „ 
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clues. See that the spaces are big enough for examirees with large 
‚ writing. | 


» 


9 Cf) For purposes of easy marking, it's desirable to arrange for 


answers to be written in the right-hand margin. It is much more 
difficult for the eye to"pick dut answers scattered all over the page. 
Even the following is rathér inconvenient: ° > й 

‘Write down the author of each of these plays. 

7.“Тһе Admirable Crichton ........ 

8. Candida ......:. 
One solution is to number (Не blank spaces and to print a separate 
cdlumn-of numbers opposite which the answers are written, thus: 

Write down the:authors of éach of these plays. 


7; The Admirable Crichton emis 
* 8. Candida . уса 


through which оу the answers are visible when the stencil is laid 
“over the test paper. Under each hole on the steritil is written the 
correct answer. 
“% 
.  TRUE-FALSE TYPE 
This generally consists of a set of statements, approximately half 
of which are true, the rest false. The examinee has to indicate which 
is which. In a spelling test the ‘statements’ may be reduced to sin gle 
Words, whose correctness or incorrectness is to be checked. Fér 
, example: А 


E] 


E a $ 
й Write'a + after each of the following words which is correctly 
spelled, and a — after each one wrongly spelled. 
9;:Medecine у 22222; > 
10. Embarrassment . 
11: Accelerate’. 55 
Й Бек sea - 
3. Advertisement ........ у „ 
> mbdification which is'useful for tests of spelling, grammar, and 
literary knowledge is to instruct the examine to cross out, or under- 
line, the one word iñ each of a set of sentences which is ungram- 


9 matical, ог wrongly spelled, or wrongly quoted. One example of 


each is given.! # 
і These should perhaps be classed as multiple-choice jte since the examinee 
» has to pick out the incorrect word from about half a dozen correct ones. 
9,» 2 , 
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14. I met him and she ifi the town. с, E £ os 
15. The ruffian demolished all the aparatus in the lgboratory. а 
16. Rule, Britannia Í В апта rules the waves. Ө NUT 5; 
The ordinary true-false type has been commonly used in tests of 
all subjects, becausé it is capable of simplirig very quickly a wide, 
range of Knowlédge, айа because examiners find it easy to construct. 
3 But the uncertainties due to guessing (cf. pp. 220-2) and, still møre, 
its liabilify to unsuspected errors, have recently made it less popular. 
Some recommend that it should only be used when the multiple- 
choice type is ruled out for lack of plausible alternatives. Thus the 
* follawing are justified since each involves only two possible enswefs: o 
17. Distance east or west of Greenwich is < 
қ known as longitude. True False, 
18. When summer time stárts the clocks . М 
are put back one hour. True False 
There is no need to confine true-false items to.rote information. 
Indeed, questions of general principle, of interfretation and the 
like can well "be*expressed in this form. Great care is needed, how? ` 4 
ever, to see that they do not violate some of the following principles 4 
of construction. e “ 
(а) Statements should be simply worded, otherwise they are 
liable to contain sections which are true and sections which are false. * 
In general it is best to use sentences which contain only a single « 
clause, embodying ónly a single idea; (о word them positively ; and 
to refrain both from negatives *and from conjunctions such as 
‘because’ and ‘therefore.’ The following examples are‘ambiguous, | 


and therefore unsatisfactory. ae Я я 
<< „19. Napoleon lost the battle of Waterloo x 5 
because his army had ifisufficient ammunition. True False 


20. Beethoven was a great composer who 5 7 
Wrote many:symphonies, operas, and s6natas. « True False 


The most crucial part'of a statement, which renders it true or false, 

should usually come near the end. E 5 25752-06 
(b) It is not еа$у to constfuct difficult staterfients, nor to estimate 
their degree of difficulty beforehand. They need to be sufficiently 
plausible to deceive the examinee whose knowledge is incomplete, 
* and therefore not “оо obviously true or false. And the true ones 
* should be just as likely to appear false as the false ones true. At the 
same time they must Бе clearly true of false to the able examinee, and 
therefore must not contain ‘tricks.’ ‘Good students look out for. 
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essentials, and 50 тау be tripped up if errors are iutroduced into 
minor details. The following example i is bad for this reason. 
31. Water boils at 212° and freezes аЁ32° С. True False 


(o It has been found. that certain forms of wording are much 
‘moze apt to be used in true "than in false statements, Or vice versa. 
Examinees soon get to recognizo these, and so may answer correctly 
without any knowledge at all of the subject-matter. The great majority 
of statements, in amateur new-type tests, which contain the words: 
‘All,’ ‘Always,’ ‘Never,’ ‘impossible,’ are false. The majority of those 
which contain “Usually,” ‘Often,’ and the like, are true. Most very 
long statements also tend to be true more often than false. Thére is 
no need to discarü the above words altogether, but they should be 
used with caution and, if possible, introduced as frequently into 
true as into false statements. 

(d) If the number of statements is less than, say, twenty, there 
should not be an exactly equal division between true and false, else 
examinees will count up to make sure that they have checked them 
‘half and half. * - 

(е) The order in which the statements appear must, of course, be 
entirely 1 random. If the examiner never pritits more than two irue or 
two false statements consecutively, examinees may again discover 
this and take advantage of it. One way of arranging is to toss a coin. 
If it shows heads, take the easiest of the true items as No. 1. If the 
next four tosses are tails, take the four easiest false statements as 
Nos. 2 to 5; and so on. £ 
oh) Various methods for recording the responses have been used: 

underlining the words True or False, or Yes or No, drawing a circle 
” round the ietters T or Е, or writing the letters T or F. One of the 
simplest and most satisfactory is to write à -+ or a — Ша blank 
space at the end of each,stavement. These spaces should all be 
aligned in a straight columa in the right- or left-hand margin. 
(g) The correction for guessing, described on р. 221, should be 
applied, dnd examinees Should be warned of this in the general 
instructions at the beginning of t test. 


MULTIPLE-CROICE TYPE, INCLUDING BEST REASON AND 
2 м, MATCHING ITEMS * 
>A, large number of variants of this type may be сее 
examples of most of, which аге quoted below. 
à @ The number of alternative Tesponses is ‘usually five, but it may 
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range from seven to three, or even two, according to the number of 
possibilities implicit in the question. Six- or seven-response questions. È 
occupy rather arlot of зрасе. Thtee- or two-resporise questions shduld © 
usnally be avoided because they require the application of guessing 
corrections. In any*one new-type testtit is Convenient, though not 
really necessary, to pfovide ne same nümbers of бте родзе fer all 
the questions. А 
(ii) When the responses consist of single words, they can be 

printed all on one line, consecutively. Tkey'should be enclosed i in 
brackets, and preferably be printed irf italics. The examinee is then 

< insteucted to underline the one word in ейсһ set of brackets.whicltis 2 


< 


correct. For examplé: * € 
2. They (was, were) going (to, too) mods (too, to) visit (ет, 
ө; aunt, (who, what) lived (there, their). xt 


When, as in most of the remaining examples, the responses consist < 


of several words or phrases, underlining is troublésome both'for the 
examinee and Гог the marker. The responses may sometimes, ‘be, . 
printed consecutively and numbered, and the examinee writes the ” 
number of his choice іп a blank space. For example: e € 

23. ‘What is the function of the sacral autonomic nervous system? 
(1) To increase circulation and respiration. (2) To control posture , 
and moyements of the limbs. (3) To increase the activity of the 
reproductive organs. (4) To contract the sphincters‘of the excretory ‘ 
Organs. (5) To inhibit the асар of the alimentary canal: 


This | ty ре of question is, however, inconvenient to read, hence it. 
is more Usual, as in the examples below, to print the responses in а 
column. The examinee indicates his choice by a cross ій one of the © 
blank spaces in the пійгріп. Иша чы this method tends to 
consume a ЊЕ of space. < 

(ii) The items may be stated in question or in Completion form. 
Thus Question 23 ‘might equally well commence: , | 

E The function of the,sacral НТ nervous system ii 

(iv) In the ‘best eisai type, the examinee‘is instructed to check 
the best of a set of reasons for a given statement. Occasionally a - 
More suitable item шау consist. in finding the worst reason. Fer 
example: | . =e 

24. Маку investigations ше that ildens intelligence is, 


Cet 


ce 
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S 
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somewhat affected Љу environment and education. „Which of the 
following provides the poorest evidence for this conclusion? 
о The correlation between the intelligence of orphans 
and their parents.is lower than that between 
childten and their.parents who-have reared them ........ 
"Thesaverage difference Between the I.Q.s of pairs of 
identical twins is higher among those feared apart 
than amohg those reared. together Ct a NA 
Children tested befofe placement іп foster homes, д | 
апа again after several years, show a greater in- | 
, crease in іп(е репсе in the better than in „ће 


poorer homes а 9410 вене ee 
* Children of professional parents are on the average ‘ 

markedly superior in intelligence to children of 

unskilled labourers . А % 5 о xu. 


(у) Two or three responses, instead of one, may be correct. 
Occasionally none of them may be correct. The instructions should 
make it absolutely clear to the examinees what is wanted. For 
example: 


25. Put a cross opposite two of the АНЕ English painters | 
who‘are especially famous for their landscapes: 


Constable ...х... IREVHOIGS (shes. ges 
Hogarth 5 Romney ........ 
· Lawrence Turner MO. 


26. Put a cross opposite оле answer only. 
Sir Thomas Browne is It known as,the author of 
Fiction 
е * . Plays 
" "Poetry . У 
Works of travel 
None of these 


(vi) The same set of responses may apply to several questions. For 
example: 5 ° 


Identify the following СОС instruments by diawing a circle 


' round 5 for sttings; W for woodwind, В Гог brass, and P for per- 


cussions: 5 К ; 
A » 
27. Clarinet. . “в 
„_ 28. Tridngle . B 5 
2 29.. Double bass Ваве јр 5 
* . 30. English hora В Р 
31. French horn Фа Р 


E (vii) Matching tests are an “extension of the previous category. A 


4.2 


à 
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number of questions and a aumber of responsesare | listed in different 
order, and have to be fitted together For example: „ 2 
v ° % ое 
32.°On the left i is a list of Treaties or Peaces, on the right a list 
of historical developments which were associated with these Treaties. 
After each of these developments write the Zetter of the Treaty’ to« 
which it corresponds. Note that some of these letters need not be 


used ағай, and that some may be used twice. e 
Peace or Treaty of: Historical Development: o 
(a) Berlin Napoleonic empire broken up р 
(6) Frankfort League of Nations established ye 
(с) Paris German аппеханон of Alsace- Н Өл 
(d) Tilsit * * Lorraine , ‹ база ы 
(е) Utrecht France regains Alsace-Lorfaine * ..../... f 
(f) Versailles Independence of Holland finally в 
(е) Vienna recognized EX КАЙ 
(^) Westphalia Britain becomes supreme colonial қ : 
power RE БМО 
Western powers set the stage for à 
* Balkan wars n 
о 
Many matching tests have equal numbers of i items in both columns, & 


each item corresponding-to one, and only опе, in the other ‘column. 
Since, however, this makes possible the discovery of resporises by a 
process of elimination, the above plan, with unequal numbers, is * 
superior. The numbers of items should usually be five to twelve in « 
each column; a greâter т number involves waste of time in Searching. 
When possible the items іп at least one column should be in alpha- 
betical order, to facilitate the search. It may be noted that this type 
is more ботрасї (Жап ordinary multiple choice,‘and so saves con- ^ 
siderable space. ‹ s a 
“Тһе rgultiple-choice type of question is the most generally ap- 
Plicable in educational testing—to diagranis, maps, &tc., as well as to 
verbal or nutnerical problems. Not only factual information, but also 
quite elaborate Ieásoning and interpretation can be measured Бу it. 
In the best-reason type, for. example, the alterratives provided may 
be graded in reasonableness: but only óne answer should бе com- t 
pletely correct. Again їп matching tests, the, examinees сап be fte- 
quired to apply a series of principles to a sefies of problems. Very 
great care, however, is needed in the construction, and thefollowing 
hints may be useful. e ж 
(а) The. multiple-choice, type should not be used when simple- 
recall questions would be ан ie. when there is only one, 


Gat 
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definitely right answer to the question. For example, it would be a 
waste of time to express arithmetical problems this way: 


a 


DD. ac а x 
33. у 3 х 5 =%1, 13, 2. 
Indeed, most mathematica} problems’ are better in open than in 
closed form, ' ‘since the latter may enable the examinee to work back- 
wards from eacli answer until he reaches the right question, which is 
a very different matter from working forwards from the question. 
(b) All the responses provided must be sufficiently plausible to be 
selected by a fair proportion of the examinees. Otherwise there is no 
object in including them. The incorrect alternatives can often be 
made up out of misconceptions and errors* which are likely to 
be prevalent among the examinees. Sometimes, also, they will be 
selected by many of the poorer examinees if they contain several 
words or phrases identical with words or phrases in the question or 
introductory statement. Both correct 'and incorrect responses should, 
however, be homogeneous in their mode of expression, length, and 
other external characteristics. If the correct answet’is always longer, 
or always shorter, than the incorrect ones, examinees: may discover 
this and employ it to their advantage. | 
(c) Each response should be studied for possible irrelévant clues. 
For example, each one must be grammatically consistent with the 
question. This is especially necessary in matching tests. Some testers 
throw together such heterogeneous items that the examinees may be 
able to match them without any veal knowledge of the subject. The 
following is an extreme (hypothetical) instance of a bad item which, 
` although supposed to show knowledge of geographical ter:ns, could 
» be answered correctly by inference alone. 


34. (а) А geographical line along 
which atmospheric press- 


ures are equal , аспе 2 
> (b) Scottish lakes or sea inlets Р E 
(c) Winds wkich blow at a cer- ~ ОЕ 
“а = tain season irr the Indian. > 


Ocean » 

(d) A high piece of land from 

y which rivers flow in two 
or more directions 

. © Lines on а map showing 
heights. above sea-level 


‘Contour lines’ must be (6), because ‘lines’ are mentioned in (е). 


кіт 


ЕА eren 
Contour lines ........ 
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'Monsoons' and ‘lochs’ must be either (b) ог (c), війсе these are ehe 
only remaining plurals. It is unlikely that ‘monsoons’ would be 
printed so nearly opposite #5 córresponding item, тогеоуеї ‘lochs’ ~, 
sounds like ‘lakes,’ hence fochs’ must Һе (b). (а) and (4) mention 
“geographical line’ ака ‘rivefs,’ hence Ssobar and ‘watershed? wall” 
probably fit them. * 5 Е 95 
The examiner should remeraber that a matchirrg test is an gx- 
tended form of multiple-choice item, and should therefore see that 
for cach item in one column there are at least three items, preferably 
four or more items, in the other colurin which constitute’ possible 
answers. Í Ы; ý 
(d) Not infrequently a better item may be obtained by turning 
round a partly formulated question into an answer, and making the ' 
correct answer into a questio. For example: : oe 
35. Emphasis by means of an understatement is called: * 
Meiosis 
Hyperbole 
eEuphemism 
Metaphor 


€ 


This might be re-formulated as follows: 


36. Meiosis denotes: Е 
Emphasis by means of an understatement `.. >. 
An exaggerated rhetorical expression ЕН 
Substitution of a pleasant for an offensive expression .... 
An imaginative single і 5 x 
Probably the second form (No. 36) requires a fuller understanding 
of the subject thant does the first (No. 35). A third, perhaps 51“ 
better form of the question, which would demand the application and E 
< nét merely the possessign of. knowledge, would be a set of phrases 
from among which examinees would select che one showing meiosis. 
Ti general tbe examiner should study cach item in an attempt to 
anticipate all the menta] processes by means of which the correct 
answer may be feached. If апу of these processes are hot representa- 
tive of the ability:that he wishes to measüre, then he shouldamedity 
the іе, : Y t e © 
(е) АП statements and answers should be a5 succinct as possible, 
Consistent with accuracy. Very verbose or complicated questions < 
may depend more on the candidates reading ability and intelligence 
than on his knowledge of the subject. (Possibly Qu. 24 offends in 
this respect.) Sy I p 
с << € сс 
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Gf) It should hardly be necessary to point out that the position 
of the correct answer in the series must.be chosen entirely at random. 
; Fivst atid last places should be Чара аз often as any of the 


intermediate places. 


D “ 
ә 


ser tae a 
221% REARRANGEMENT ТҮРЕ > 
А series of items which normally fall into a certain specific order 


is listed in disarranged order, and the examinee has [0 rearrange 
them. For example:*” , 

37. Below is a list of waves or rays. Opposite No. 1 in the right- 
hand margin, write the /ёїгег (a, ог b, ог с, etc.) of the wave or ray > 


yeu the longest wave-iength. Opposite No. 2:write the letter of the 
vaye or or ray with the second longest wave-length, and so on down 


(a) d from а sodium vapour lamp I i 
(b) Waves from a black kettle containing hot 2. ....d... 
» маг, 
. (с) Gamma rays TE S HEAT BU 
* (d) Television transmission waves ESL иас; 
(е) Rays from the green of the spectrum DUM e: 
6 


Xf). Wireless telephone waves JUS are 
(g) , X-rays " T. ге: 
The mental processes involved in answering this question will 
naturally depend on what has been taught. If the pupils have been 
provided with such a list, they may answer by rote recall ; if not, then 
they may " have to perform an elaborate precess of analysis and re- 
synthesis of their knowledge of various branches of physics: 
“Ву disarranging the order of phrases in sentences, Ballard (1923) 
provides a test which, he claims, measures knowledge of sentence 
structure айа of the canons of English style. The time sequence of 
historical events, the sizes of geographical areas, the order ог opera- 
tions in some chemical ог piysical experiment, grainmatical se- 
quences in foreign languages, and the like, can readily be expressed 
‚ in this type. It,has not, however, been уегу widely, used, probably 
because of the difficulty of devising a^simpie and just method of 
scoring (cf. Sims, 1937)„ВаПага' s (1923) plan seems fairly satisfactory. 
Award one mark if the first item 15 correctly identified ; thereafter 
award one mark for each item which correctly follows the one 


s The reader 1 may wonder why the instructions are not " simplified by requiring 
examinees to write 1 after the item that Should come first, 2 after the next, and 
50 оп. The reason is that the scoring of such а, response would be exceedingly 
time-consuming, unless the wliole of it was correct. 5 


9%. 


э 


" choice types. . 
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preceding it. For instance, if an examinee answers Question als 


(Р) (4) (a) (е) (е) (0) © instead of (f) (а) (5) (а) (е) (в) (с), 


(Г) is.correctly placed first'ánd Scores one 2; (2) correctly follows ( em 
and scores one; (a) should not follow (2), but (е) antl (g) аге in’ 
correct sequence, and so score two; (с) does not score, althoughit is: 
correctly placed’ last, because it should not follow (b). The total 
Score hére is thus 4 marks. | ГА 
Certain other minor types of question have been described by 
Ruch (1929), Rinsland (1938), and others. But these all appear to be 
variants of the simple-recall or completion, true-false, or multiple- 
" 


‹ 


° “ ‹ У 


PRACTICAL EXAMINATIONS 62-6 


Probably one of the most unreliable of conyenticnal examinations 
is the practical in physics, ¢hentistry, and biology and the engineering 
trade test; because a 3-, or even а 6-, hour period,can provide only 
such a limited-sagnpling of the skill or knowledge of the examinecs. , 
Though such examinations cannot be converted to wholly objective, 
multiple-choice form, it was found by American institution§ for 
technical training of recrüits during the war that great improvements 
Were possible along the same lines. The operations at which examin- 
ees are supposed to be competent are analysed into a series of steps, 
and some 30 to 50 typical problems are selected, each of which can 
be done in à few пише Forgxample, in electricity, а certain circuit 
is nearly completed, and the examinee has to make one essential 
connectian which will show his grasp of the circuit without his hay-‘« 
ing to do the whole wiring. In chemical analysis, he may be fold that 


' Such and such reagents have given such and such results; what should " 


he do next? Several questions in an engineering examination may be 
based on capacity to choose the right tools for a ріуећ job. In others, 
parts of machines may be laid out and' the examinee required to 
indicate how they should be repaired, and so on. Each of the fifty- . 
Odd questions needs to be’, supervised Фу а separate examiner; 
Senior sstudents can often be employed for»this purpose. The eg- 
aminees move round from one problem to апојћег іп їшгп, а5 тапу 
being examined simultaneously as there are problems. Each problem |, 
is usually marked as right or wrong, the right response being care- 


' fully defined beforehand; though cccasionally half-tharks may be 


6 


Biven if partial competence tan be adequately defined. 
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» у 
FINAL STAGES IN THE CONSTRUCTION OF А NEW-TYPE 
и ТЕЗГЕ 3 
The examiner should rcscrutinize every item, try to eliminate, all 
ambiguities, and ensure «ће! the whoie of each item is going to 
function, i.e. that every word or phrase will be essential in arriving 
at а correct response, and that there is no possibility of obtaining 
correct responses through irrelevant clues (cf. Hawkes and Lindquist 
et al, 1937). This can be done much more effectively if the questions 
are submitted to other exporienced examiners who may detect 
flaws which the author has not noticed. s 
Next it is essential to make up comprehensive instructions for cach 
type of question, which will be appropriate to the mental level of the 
2 duiiest examinees. Unless the examinzes are fairly sophisticated to 
new-type testing,-sample questions already answered should be in- 
cluded. In classroom work it is a good plah to provide three samples, 
one already answered, to read this through with the class and point 
out why.the given answer is correct, and then іс ask the class to 
suggest answers to the other samples, helping any individual mem- 
bers'wlio still do not understand what is wanted. 
Attention should then be directed to вазе of scoring. If all the 
answers can be aligned in a column in the right-hand margin, a key 
. сап readily be prepared consisting of a sheet for each page of the 
test, down the left-hand edge of which are written the correct answers 
in the correct positions. А Жы 2240 : 3 
It might Бе supposed that as some questions are harder than 
others, they should be allotted more marks. But actuglly many 
experiments show that weighting is hardly worth the trouble. The 


unweighted total scores correlate almost perfectly with weighted " > 


totals. A very simple form of weighting such as the following may be 
used: award l^mark to each correct simple-recall of completion 
response, also to each correct response in a matching item; give 1 
märk to each true-false or two-response multiple-choice item and 
apply the R-W correction; give 1 mark to each three-response 
multiple-choice answers but neglect the correction for guessing; give 
2 marks to each multiple-choice ifem where four or mére responses 
‚ are'provided. ` : 
„ Although scoring is so rapid and fool-proof,'slips do occur, 
particularly in adding up totals. Hither then the marker should go 
through the papers twice, or else let the students or pupils re-mark 


ny 
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their own papers and ask. them to bring any errors to his notice. 
Finally the examiner will certainly improve, his future examining 
if he takes the trouble to studythe difficulty andthe diagnóstic value,, 
of eách item. A simple method is to зом ће papers іпір four piles— 
the quarter with the highest totals, tha:quarters above and below the 
median, and the quarter with the lowest totals. The numbers of times 
each item is correctly answered Ьу'їпетЬегз of each gtoup, and by 
the four groups combined, may be tabulated, The examiner may 
then compare the actual difficulties of the, iteins with his estimates of 
difficulty made when setting the paper, and may note whether the 
proportions of passes for each item ascend regularly from the lovzest , 
to the highest group. If they fail to do so he Should attempt to 
discern the flaw in the item which may be responsible for its poor 
validity. t 
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CHAPTER XIV? 
IMPROVEMENTS IN EXAMINING . 
EXAMINATIONS have far too great a burden to bear. They»are sup- 
posed to indicate, as shown in Chap. XI, a large number of only 
partially,related traits and abilities. If we could define more clearly 
what it is that we want іс measure, and use for each purpose. the 
most appropriate instrument, there is no doubt, that we could effect 
a great improvement in the efficiency of these instruments. Let us 
consider first some of the substitutes for examinations which have 
been recommended by educationists, and the functions which they 
may serve. * . 
Oral Examinatigns, Viva Voces, and Interviews.—Oral examina- 
tions are, of course, much older methods of educational measure- 
ment than written examinations. Their main defects were pointed 
out whea written examinations began to take their place, over one 
hundred, years ago. It was recognized thai the situation to which 
successive candidates are subjected cannot be the same for all, so 
that no sure basis exists for a comparison between different candi- 
dates. The, examiner’s tone of voice, the wording of his questions, 
his encouragement or criticism, etc., тау “vary widely from one 
candidate to, another. If, as in school inspections, all the candidates 
are interviewed,in the presence of one another, the examinar has to 
pose à series of different questions which are unlikely to be closely 
‘comparable in difficulty. In any viva voce ‘the amount of time de- 
voted to each candidate is usually so short that reliable indications 
of his abilities сал hardly be expected. There is the further defect that 
the situation may cause much greater emotional disturbance than 


‚ does a written examination. The interviewer may wish to take certain 


qualities, of temperament aad character into account in his assess- 
тері, but the emotionel qualities which are most prominent during 
the interview are probably not very typical of the candidate's usual 


, behaviour, nor of any great significance for educational purposes. 


When care is taken to put the candidate at ease, skilful questioning 
may, indeed, reveal something of Iiis interests, his attitude to work, 
anc cultural background, which are educationally relevant. But as 
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often applied in 11+ selection, the secondary sch6ol Head’s irser- 
view probably does little more than assess social class and пісе 


manners. Oral ex&minatiogs in¢foreign languages will bring out the © 


canditlate's goodness of Pronunciation, ‘but cannot be trusted to 
show his facility in conversdtion, because Of the distorting. effects of. 
emotion. For the same reason, viva voces in other stibjects cannot 
throw much light 6n the candidat% capacity fof formulating ді 
knowledge orally. Theoretically they should constitute a valuable 
supplement to written examinations, since ‘they provide a fresh 
medium of expfession for ability. But investigations have sliówn that 


judgments based on interviews tend to be Still less reliable then those o 


based on essaj-exantination answers, Se ees 

Hartog and Rhodes’s (1935) study of the Civil Service interview 
is particularly apposite. Sixtesn candidates were interviewed іп ће“ 
ordinary way by two different boards, each composed of fout or five 
experienced examiners. The ‘members of each „board consulted 


D 


beforehand as to the questions they would ask (but*the two boards 


did not inter-communicate), and had before them the candidates * 


educational records, During the quarter- to half-hour interviews the 
members attempted to assess a candidate's alertness, general in- 
telligence, and the suitability of his personality for the CivilService. 
After discussion they drew up a final mark. Tt Was found that mem- 
bers of any one board did agree fairly closely in their judgments, but 
that the two boards had reached widely differing conclusions. The 
two final rharks а егев by 12% for the average tandidate, in one 
case by аз much as 31%, in another by as little as 1%. The correla- 
tion between the two sets of marks was only +041 + 0-14, Except 


for the small number gf the candidates, there seems to be по „ 


technical, flaw in this experiment which would allow us to challenge 
its depressing results. м У 6 

‘Thgre have been many other studiés of the interview in the 
educational, vecational,,and military fields, showing similar*in- 
consistencies antong interviewers (cf. Vernon and Ратту,•1949). In- 
view of the frequent use of thħeînterview Бу university authorities für 
selectirfg students, Himmelweit and Summelfield's (1951) study 45 
very relevant, They obtained almost zero corrélations between pre- 
dictions made by university staff from interviews and. subsequent 
degree performance, whereas an academic entrance examination gave 
small positive correlations and а battery of psychological tests 
distinctly better ones (cf. also Vernon, 1953). It would seem that 
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oral restes and, the interview, though essential in education for 
‚ purposes of teaching, are useless for purposes of assessing ability or 
» th@results of teaching. In the cours? of г a school yé year a teacher can 
undoubtedly build up a fairly comprehensive and reliable picture of 
tack of his pupils’ abitities, 'vhich шау be largely derived from oral 
"Work. But the conditions here are very different from those in a 
brief viva voce. The volume of questioning is far larger, the emotional 
atmesphere less strained, and the oral answers are supplethented by 
written work. T RN 
" Perhaps the best instance of the value of the interview is provided 


D 


» by the work of vocationa? psychologists. In advising a child op his о 


career the psychologist’ applies several objective tests of aptitudes, 
an@ takes both medical and educational records into account. But 
"his information as to the child's personality and interests is derived 
chiefly from an iriterview. This is conducted informally so as to put 
the child at ease, and though it follows no stereotyped plan of ques- 
tioning, it has, im fact, been designed very carefully so as to elicit 
a he child's main traits. A further function of this interview is that it 
y^ enables the psychologist to interpret the test results and other 
objective records, and to realize their significance i in a synthetic view 

of the child as a whole. Guidance, which is based merely on the 

^ mechanical application of tests or examinations, is not successful. 

› But there is ample proof that the guidance given by trained psycho- 
logists who adopt these test and interview metheds does possess a 

high degree of predictive validity: Again. in Vocational selection for 
Skilled trades, the psychologist can often use the interview success- 

% ‘fully | for assessing the amount and the relevance of previbus trade 

X experience, though—when conditions permit—he also constructs 


» 


and validates a battery of selection tests (cf..ftn. p. 145). Fer higher- °' 


grade occupations such?as managers, officers іп the Services, and 
teachers, where tests seldom seem to be of any value, the techniqle 
of Óbserving small groups of candidates,(as developed by War 
Oflice and Civil Service Selection Boards), corabined with a psycho- 
logist'e'interview, is the mést successfel'(Vernon, 1953). 
>И would seem, then, that the interview, as at present emplóyed in 
educational circles, 12 practically valueless, but that in skilled hands 
^ it may become a most useful tool. Thete-is no reason other than 
expense and the scarcity of competent psychologists why the methods 
of vocational selection and guitlance should not be applied in educa- 
. Чоп, both for choosing pupils most likely to benefit from higher 
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education, ang for advising them as to the lines of study for which 
they are best fitted. This is already common practice in many ee 
American and some Austfaliar® Universities It is occasionally Red o 
here at the 11+ selection stage, though’ necessarily confined‘to a 
small proportion of*borderline candidgtes. < Bore È 
Intelligence ала Speđial Aptitude Tests. —Theoretically, intelligence 
у tests should be able to fulfil orfe of the functions commonly assigfied 
to examiflations much better than examinations do at presents that 
is the selection of pupils or students with the greatest intellectual 
promise, as distinct from achievementS As we һауе seen, this is not 
% because they really measure innate ability, but rather because Шу о 
show the general levél of thinking (Най the child Ваз developed out- 
side school, which he should be able to apply in school. They shguld | 
also be able to measure obj€ctively the qualities which essay-type 
examinations are supposed to elicit, and which hew-type fests are © 
supposed to be incapable of eligiting, such as 'originality, pewer of 
thinking, grasp of general principles, etc.’ Experiniental investiga- 
tions indicate*that they do have some value for such purposes, but * Ы 
that they need to be employed with caution. They seem to be Most 
useful among younger children. The Terman-Merrill, applied а бог. 
7 years, cotrelates very highly with educational achievemerft for the д 
next few years «coefficients of - 0-80 or more may be expected. At * 
the 11-- selection Stage, the group intelligence test usually gives « 
slightly better correlations with subsequent achievement" than any 
other single instrument. Within a%elected gramniar school popula- 
tion, correlations would be expected to drop to about + 0-40, hence 
there wif naturally’be many discrepancies between ЕО. and bright-* 
ness as observed by (һе, school teachers. But when these teachers ° 
cfiticize 4ests they fail to realize how very much worse most of the 
pupils with L.Q.s of 110 down to 70 would Be. In the modern school, 
correlations remain relatively high. T « 
At later ages, the agreement between general intelligence Tests 
and academic Work is.usually lower still, because of the still greater ^ 
restriction of ramge. In Антегісап universities, with theft more 
heterogeneous students$ coefficients still average around 0-4 to 0:5 
(cf. Eysenck, 19475). But here the figure is ustiallyenearer 0-2 to 0:3 
With degree results. Similat correlations are found among: Training 
College students with their сот пей theory marks ;. with practical 
* . teaching marks there is little or no ‘agreement. ‚ , A 


More useful predictions at late adolescent and adult levels are n 
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likely to be reached,through testing special aptitudes, or group fac- 
tors. Prognostic tests are intermediate between, intelligence and 
attainment tests. They*try to show, not Xow much*mathematics or 
English the pupils have previously acquired, but how well they éan 
apply their intelligence plus àcquirements to fish mathematical or 
linguistic problems. Earle (1949) has had some succéss with English, 
ве псе, and mathematics tests, and With his Duplex tests, in diag- 
позе success at different types of secondary courses. Many Ameri- 
can colleges give ‘placement’ examinations to incoming freshmen, 
consisting of linguistic and numerical -+ non-verbal intelligence tests, 
together with English, mathematics, science, foreign language,»and 
social studies tests, in order to détermine the students’ main aptitudes. 
„There is room for considerable development of such work in our own 
schools and universities (cf. Dale, 1954). 

Cumulative Record Cards.—The primary function of many exam- 
inatiors is to proyide a certificate of work done. Ostensibly this is 
true of the Сепёгї and Scottish Leaving Certificates, and of univer- 

` Bity degree examinations, though they are also 8ften interpreted 
prognostically. Now, in most schools the teachers know quite well 
what work each pupil has done. Hence there is every reason for 
making more use of teachers' records of this work and less use of 
external examinations. The former cover the whole ground, the latter 
never more than a fraction of it. The former are collected under 
everyday Conditions, the latter when many pupils ate in an abnormal 
emotional state due to cramming ànd fear of failure. The former are 
„2160 less likély to produce the degrading effects upon teachers’ and 
pupils’ educational values which are characteristic of public com- 

‚ petitive examinations, Teachers’ markings-of school work are, of 
course, just as liable as examiners’ marks to subjectivity.and ий- 
reliability ; perhaps more’so, because of the prejudices they are likely 
to possess regarding each pupil’s character and ability; which» may 
infljence their judgment of his written work. But'if ir the course of 

” his career a pupil is tàught by several different persons, each of whom 
regotds”his achievenients according {© some uniform system, the 
cumulative record should yield a fairly reliable picture. Although, 
as stated above, the function of such records should be to certify 
work done rather than to predict future promiss, yet the investiga- 
tión,by the Scottish Council for "Research in Education (1936) into 
the Leaving Certificate showed' that teachers! marks did in fact cor- 
, relate with subsequent university success just ‘about as well as did the 
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external examiners’ marks. Likewise іп McClelland’s (1942) and 
several subseqyent studies of secondary school selection, scaled 
primary schodl teacherstestimates have given better predictions of 
gtammar school work than have objective attainments tests. The 
many other uses № which cumulative records can be pet, süch as 
vocational and educational guidange, have been fully, descritted by 
Hambey et al (1937) and Walker (1955). 6 

The objection which will at once be raised to such schemes is—how 
are the standards of different schools tocbe Compared? McClelland · 
(1942) demonstrated the enormous discrepancies between teachers? 
assessments of thejr pupils and the levels actually obtained by the* 
same pupils on objective tests, some‘teachers being far more generous 
than others. Indeed unless some system of. scaling or standardizfition 
of assessments is adopted, the inclusion of such estimates in a selec- 
tion procedure is known to lower its validity. Several. possible 
solutions to this difficulty have been tried. ‹ t 

Valentine (1938) proposed that at 11 + опе ôr ‘more general'in-. 
telligence tests Should be given in all schools under one authority. 
If, say, one thousand grammar school places were available, and an 
1.0. pf 115 was found go cut off the top thousand pupils, then the 
number 6f pupils in each school scoring 115 and over cónstituted , - 
that school’s quota.” Meanwhile the head and част of the school “ 
would have ranked their own pupils in order of suitability for « 
grammar, schoof education. This rank order and the Г.О. would 
‘finally.be combined with, appropriate weights. Each school would 
send its allotted quota to the grammar school, but the precise pupils 
included in the qfiota would be those whom. the primary teachers 
had chosen as most able, not merely those with the highest LQ.s.e 
While this quota schente has, in fact, been used in certain urban areas. 
it is quite unsuitable for areas where there are,numerous small 
schools, each submitting a dozen or fewer candidates. Even with big 
Schools it is rather unreliable since, as shown above (p. 209), a pekcen- Я 
tage or proportion Каз а very large Standard Érror. — * eis 

Ац alternative plan, whith is receiving experimental trials, ig to 
fix each school's quota*initially at the same number as gained grám- — 
mar school places in the previous year (or recént years). A few pupils А 
Who are ranked by«heir'school as just above or below this borderline 
are examined, and the quota iscatljusted up or down according‘to 
their results. This method is referred to by the Ministry of Edyca- 
бп as selection without any external examination—not even an, у 
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intelligence test. We have already pointed out that fairly,considerable 
variations in standards may occur, by chance, from year to year 
xp. 63). Thus it is possible that the ‘method, though admirable in 
intention, may break down through the necessary quota adjustments 
beconiing foo numerous and unwieldy. m 

Many American Universities employ what i is} essentially, an elabo- 
ratin of the samé scheme. They ‘recotd the final secondary school 
marks' of students coming from various contributory schools for 
several years, together with;the performance of these students at the 
university itself. Eventually iť becomes possible to determine the 

»reldtive standards of the secondary schools concerned. For example 
it may be found that students with 60% from School A are just as 
goods those with 8075 from School B. Thus the numbers of students 
worthy of selection from each school can be deduced. 

Probably the best planisthat advocated by McIntosher a/(1949), the 
National Foundation for Educational, Research and others, of scaling 
each school’s озуй tharks or rankings against some uniform external 
‘sët of attainments fests, by an adaptation of the techrtiques described 
above (pp. 54-8). Selection can then be based primarily en the teach- 
ers' judgments, derived from internal school examinations and from 
cumulative records of junior school work. But each set of jüdgments 
* is calibrated orsscaled up or down to accord with the distribution 

» obtained by that school's candidates on the external, objective tests. 
An agreed Weighting can be given to the objective tests also. Such a 


scheme too requires fairly large numbers. рег school, though Mc-« 


Intosh finds it workable even among schools submitting only half a 
‘dozen candidate’. If has the disadvantage that schbols are still likely 
to cram for the external tests, but it does take full account of teachers’ 
knowledge of their own pupils. A useful discussion of these and other’ 


problems raised by the various selection procedures at 1 1+ is aei 


by Dempster (1954). s ‚ 

Methods such as these require to be worked out and Шс а һу 
"a statistically trained Bsychologist. As described,here, they might not 
appear td be appropriate аз substitutes for the General Certificate 
or other higher-level examinations. And yet the accrediting systems 
‘adopted by many;Australian and New Zealand Universities show 
‘that something similar вап be devised, whick will Jargely banish the 
strain.and unreliability of externally-conducted examinations among 
secondary school leavers, and yet provide a reasonable guarantee of 


scholastic achievement and'of кысышу for university courses, 
.» 
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IMPROVEMENTS IN ORDINARY EXAMINATIONS 
Since external examinatiogs wifi for a long time play a large?thoufh ‘© 
we Коре diminishing, part fn education, ahd since еуеп ап an ideal 
system teachers must®apply internal exgminations, and mark tkeir * 
pupils’ work, we must tonsider carefully‘the improvements which 
might be made in the examinatton as it now is. Frofn thé investiga- 
| tions we have described, and from the discussion in previous chapters, 
the following principles may be deduced. , < Е 
1. The teacher ог examiner should fórmulate as precisely às pos- 
«sible gvhat it is that he wants to measure. "ТА implies first a cleat ° 
conception of the objéctives or aims te which‘ in his vie, education 
should be directed. At the primary school level he may be content 
with the view that education*consists in implanting certain tradi- " 
tional knowledge in children's minds, such as the vay to wórk out & 
sums and to parse sentences. This may be one objective in the secon- 
dary school and university also, though surely побойе of the most 
important. Historg, science, foreign languages, and thelike de involve * 
the acquisition of a certain body of facts—dates, chemical formule, © 
meanings of words, etc. ; but they are supposed to posses? many у 
additional Yalyes. Taste for literature, interest in foreign peoples, 
Scientific habits ef mind, ability to apply knowledge.to-the interpre- “< 
tation of present-day events, are but a few of the things which ме. 
hope to inculcate. Чи making such an analysis, the teachermust be 
careful to take into accounts the facts which psychologists have 
established in regard to transfer of training, namely that general pi 
faculties, ВКе observation, memory, and reasoning; wifi not be deve- * 
loped merely by the study of some particular school subject, and , 
"tht indeed these faculties haye little real existence. For example, a 
rational and $cientific approach to contemporary problems can be 
developed by«studying typical problems’ rationally, by conducting 
| discussions апф projects, by showing how irrationalities of thinking 
| arise, and so on ;*but met by the carrying out of stock éxpermentsin ‘ 
Physics, nor by memorizing «ће rules of “grammar or logic. А BE ^ 
havioristic formulation should be helpful: instead of talking in ternts 
of vague mental qualities, we should ask—wHat cen the educated 
« Pupil or student do ip each Specific field of study, or in the world at 
. large, which the uneducated one cannot do; and what are the precise 
* elements of our téaching which are'supposed to confer upon him у 
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‘Having formulated as candidly as possible our general philosophy 
of education, we can ask which of these objectives are to be examined. 
» We may decide that desirable 0181 conduct and social adjustment 
cannot be expressed in written work. But each objective shoulil be 
‚ considered from the 'point*of view—what сап the educated pupil 
write, or what problems ‘can he solve, which {һе uneducated one 
cannot. = $ 2 
The external examiner, who has not been responsible for teaching 
the examinees, should adopt a similar procedure in deciding what his 
examination is meant to measure, He needs, however, to be especially 
careful to emphasize ‘can’ rather than ‘ought’ in his analysis, It is 
merely futilé for him to complain that teachers are not doing their 
_ work properly, and to insist that examinees must conform to intellec- 
` tual standards which may be current among persons like himself, but 
which àre quite unattainable by children or immature students. His 
job is.to set tasks which will differentiate the better pupils who exist 
in real, not ideál;'schools, from poorer ones. If he is supposed to pick 
“out, say;the best 3075, but formulates objectives which even the best 
5% сап hardly achieve, then he is not doing his work-efficiently and 
is creating unnecessary difficulties for himself when the time for 
marking arrives. е 
A group of personis, such as a head master and his. Staff, or a board 
of examiners, should be able to carry out such analyses much more 
logically “апа thoroughly than can any one person, however clear- 
headed he may be, since they сай criticize опе another's ideas. 
, Finally, the conclusions of this analysis should be communicated 
* by the examirlers to. their prospective examinee$, or by th$ teachers 
ә to their pupils. At the present time examinin g frequently degenerates 
into a battle between the examiners who гу to catch out theřr victimis, 
and examinees who try to find,out what their examiners really want, 
and what are their personal prejudices, by conning the papers they 
һауе set in the past. АП this is a stupid waste of everybody’s time. 
2. Having decided upon the abilities that ће wishes to measure, 
the teacher or examiner should consider the form of examination 
Which will be most appropriate fo; each ofsthem. French conversa- 
tion, qualitative.cheinical analysis, and other capacities obviously 
* require special types. But in most instances the important point to be 
settled is whether the examination should be new-type, essay-type, 
or a mixture. Two pertinent cohsiderations in this connection are the 
, number of examinees, and the examiner's own skill in constructing 
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| | new-type quesfions..If the rtumbers аге very small, he may be justified 
in shirking the extra expenditure of time required by ohjective exam- EM) 


; ining. And he should knew (т past ехрепенсе that ће is nfore © 
successful in devising new-type measures of some things than of 
others. tid 9 = es 


Generally spéaking,“any kind of factual informatici, ог straight- 
forward skills, such as the carrying out of arithmetical operation( or 
grammatical analyses, should be examined in new-type form, instead 
of being left to the vagaries of subjectivg marking. This need not 
mean that only trivial and unrelated"items of knowledge’ or skill 

* shoyld be inc|uded. Good questions, as we have seen, can bring dut. с 

the examinees’ ability to interpret, the significance of facts, grasp of 

general principles, and application of such principles to specific ew 
problems. True, there are miany dangers in: these, more elaborate 
items, but they are hardly as serious as the dangers of trying to, 
discover all the underlying abilities of the examinees from their 
answers to essay questions. қ? | 

We admit, however, that many important aspects of ability may“ | 
be displayed «more clearly in essay-type answers or (with subjects “ 
such as mathematics ang physics) іп the working out of long and ‹ 
complex Problems. Literary style of composition, imaginative 
qualities and originality, evidence of the examinee’s interest in the ' 

Subject and of his own thinking about it, are some of the things «< 

which the examinfer*may wish to elicit, even if he can nevér hope to 

measure them periei xeliably.*If the examination is designed 

Primarily to bring out qualities such as these, then thé examinees 

should bé told so, айа should be informed that amount and accuracy * 

of information will not, in this instance, be taken into consideration. ., 

Ffank directions about what is wanted are particularly desirable 

When the same examination includes sofne new-type and some 

еззау«гуре questions. © У 
3. If, for оке reason ог another, essay-type questions һауе td, be 

maids-of-all-wofk, thën the examiner should ask himself which parts 

of the total field to be examined are the most important. Клои ја « 

that his questions can sample only a fraction of it, he should avoid А 

the less essential parts, even at the risk of his Questions being easily ; 

* ‘spotted’ beforehand. Overlapping among the questions is obviously © 

« Undesirable also, since it is wastefulto cover any one part twice over. 

The unrepzesentativeness of single'fapers, and the desirability of 
improving eliability by increasing their number should be borne in 5 


cet 


о 


© 
e 
© 22 у 


t € ә = 


09 : 


256 - , THE MEASUREMENT OF ABILITIES ‹ 


mirid. With each question, the length of time needed for a reasonable 
„answer should be considered, so that the examinees shall not be set 
о too much, nor too little, to do іп thé period allowed. 

Mich investigation ћаз Ђееп made into the kinds of essay ques- 
tions which are least unrefiable, particularly by the New York 
College Entrance Board (cf. Lindquist, 1951). It has been found that 
ratger short discussion questions, each of which poses a definite 
problem and supplies specific indications of what is wanted, are the 
best. Such questions still,give plenty of scope for the examinee to 
reveal his powers of marshailing and organizing his material. The 

2 much more vague, yet very popular, type of question which allows 
greater freedom of,respOnse to the examinee only leads to confusion 
апфа сину in comparing one examinee with another. 

4. Perhaps the most important quatity in a good examiner is his 
ability to foresee the sort of answers he will get, both through his 
insight.into the minds of pupils or students, and through his experi- 
ence of their answers in the past. If in formulating each question he 

' possesses. a clear „conception of the probable? arnswers—good, 
average, and bad—the following advantages are likely to accrue. 
The quéstions will be simply worded, free from ambiguities, and 

, they wilknot be interpreted in such divers ways that the answers are 
too heterogeneous fo be compared, or marked consistently. The 

» questions will also be appropriate in difficulty, not, that is to say, the 
type of questions which no one but the examiner himself could 
answer adequately. Most important of all, thé production of answers 
to these questions will really bring into play the abilities which the 
"examination 19 aithing to measure. When, later оп, the $xaminer 

ә teads and marks the answers, he should continually be trying to 
improve this ‘faculty,’ by noting how far his expectations:are fiti- 
filled. Every question may be regarded as an experiment in forecasting 
which, although .never likely to be completely successful, may bê- 
come progressively more so with practice. , о 

In this, аз iti all ether branches of examining, tWo opinions are 
likely to be better than опе; three {һай two. A board of four or five 
wiil perhaps be the most efficient. E s Lo 

5., When essaystypé questions call for answers of a largely factual 

* natute, a.detailed scheme of marking should bs drawn up. This is 
especially necessary when several different examiners each mark a 
certain proportion of the papérs. Experiments indicate that it will = | 

, reduce their variability by roughly one-half. But a single ехатійег 
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will also greatly improve his consistency if he, dčides beforehand 
just what points to look for, and marks these objectively. 

In assessing the more general qualities whick essay апбууегв аге ©, 
supposed to bring out, the quality scale те од (cf. р. 34) should be 
adopted. The examiaer зћо а always Бе òn che alert to ensure that 
he is judging the scripts solely on the basis of the criteria Which he 
has previously formulated, that he  повб being biaSed by bad haad- 
writing of spelling, by some trick of stylé which jars on him, or by 
some personal predilection regarding the subjéct-matter. All marking 
which cannot ‘be done objectively should consist largely'in intro- 
spection: “Why does this strike me as geod or bad? Are these the 2 
qualities which Тат supposed to be judging, and are they the qualities 
which I took into account when I selected my standard specimens?" 

Many questions may advantageously be marked both by the" 
qualitative and by the objective (count) systems; the figures then 
being combined in due propottion. If two or more examiners mark 
the same, or even a few of the same, scripts, it wilh be particularly 
valuable to artalyse the reasons for any discrepancies in marks, so as * 
to clarify still further their criteria of good and Dad. 

One other source of prejudice often arises in marking if the 
examiner is personally acquainted with some, or all, of the examinees, 


с 


Ейһег ће should carefully avoid looking at the патез,оп the papers, © 


or, if he cannot help recognizing their writing or style, should do his 
best to disregard. any previous impressions he may һауе formed 
about them, Ф.Е 
6. With the statistical aspects of marking we have already dealt 

in Chapg¢ I-IV. Bat the following main points їн subjective mark-^ 
ing (as distinct from count scoring) may be recalled. Marking is first 4 
and foremost the arranging.of scripts in order of merit, or grouping 
them into categories of merit. It is not an attempt to give an absolute 
jüdgmefit об һе value of each script. АІМһе answer$'to one question 
should be classified into such categories before starting on another 
question. The categories should be, approximately,*evenly spaced, 
and should not exceed ten in, umber. THe frequency distribution of 
marks"for each question should.tend to the*normal shape, and the 
average mark should be very close to 5 (assutning the middle cate- « 
gory of merit to haye beerí labelled 5). The dispersion of the distri- © 
bution should also be approximately constant for all questions, 
unless it is desired to weight some more heavily than others. The 
final total scores for the’scripts may be,converted into an appropriate 
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distribution of percentage marks by any.of the three methods given 
in Chap. IV, All decisions as to standards of passing and failing should 


„ be spostponed. until the process of слап па has been completed. 


The final step in many examinations is ‘о reduce the marks to two. 
three, or more categories (Pass or Fail; First, Second, Third Class, 
'etc.j. Here it.is most desizable that all scripts which fall within a 
fev: per cent. above or below апу important borderline should be 
redd,a second time, preferably by more than one examinei, so as to 
ensure that the eventual decision shall attain the utmost reliabili ty 

of which fallible human judgment is capable. 
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1.,12. E 28. None 
2. 6. $ 29.-А. 
3. 150, 3» 30. B. 
4. 105 (104 or 106 may also 31. 69 or 62. 
22 »count).- и 32. — А ог — 3. 
„57 80. у 33: О à 
6::21; 5 34. р . 
7. 92 (ог 31). 35. В. 
8. 17 (or 15-16). 36. D. 
9. positive (or plus). 37. B e 
. 10. faculties, 38. C. 
11. Spearman, ` 39. A. Be 
12. 2. » 40.„Мопё 
13. specific.» 41. D. 
4147 g. : 222 42. E. (In*Nos. 42:47 опе 
15. group (or common). 43. D. , point is scored by each 
16. К, К: m or F. 44. СЕ. ,answer in which one бг 


17. +. (In Nos. 17-25, Sub- 45. А. both letters are given 
18. +. tract оце mark for wach 46. CA. correctly. .Nothing is 
19. ;-. incorrect answer. No 47, EC.,scored by any answer 
_20. +. gnessig о correction 48, D, which also includes an 
21. m.u need be applied in’sub- 49B, incorrect letter.) 


22. —. sequent questions.) „50. б. «For scoring of Nos. 
23. — 5523 51. А. 48-54, see р. 242.) 
ACE Е DE. 

2% zy » 53: Е 

26. С. = 54. С 
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Piaget, J., 156; 26% 7 А | Terman-Merrin test, 21, 2% 23, 68, 72, e 
Picfute Frustration test, 187. ® 144, 159-60, 162-6, 167-9, 179, 
Pidgeon, D. A., intelligence teste 182 4192-3, 249 * 


Porteus, S. D., Maze t, 181, 193, 
PL SO dE Е 

Primary Mental Abtlities (Р.М.А.) 
tesis, 144, 185 . 

Progressive Matrices, 179, 182 

Pullias, Е. У., 127-8, 262 


Raven, J. C., Progressive Matrices test, 
17 2 


9-80, 182, 26 

ction test, 187, 262 
vÉcabulary sales, 174, 179 

Rhodes, E, С., 27, 203-5, 247, 261 

Richards, M. K. B., clerical test, 185 

Richardson, C. A., intelligence tests, 

183 

Rinsland, H. D., 243, 263 

osvazweig, S., Picture Егазйацой 


test, 187 . 


Rothney, J. W. M., 68, 260 
Ruch, б. М., 221, 243, 263 


Saffir, M., 260, 
Schonell, Е. Е.; 263 
English test, 177 
Schonell, F. 4., 154n., 158, 283 
educational fests, 155, 173-5, 177 
intelligence tests, 183, 263 
personality schedules, 187 
Scottish Council for, Research in 
Education, 707. 159, 209, 250, 263 
Seguin-Goddard Formbo: d, 181 
Seven РЁ з Assessment tests, 195, 178 
Simplex Intelligence test, 183 
Simplex Jffnior tests, 883 
Sims, V..M., 242, 263 
Slater, P., personality tests, 186—7, 259 
Steight, G. F., intelligence test, 482 
Smith, I. M., mathematics test, 177 
spatial test,9184 
ты Arithmetic test, 178 У: 
intelligence test, 182 
Spearman, С., 904, 127, 129, 137, 
141-4, 147, 150, 152, 167, 168, 
263 Ша 
Stanford-Binet tests,72-3, 112-18 152, 
158, 
tests 
Starch, D., 203, 263 я 
Steel, J. H., 21, 263 $^ 
Stephenson, W., intelligence test, 182 
Summerfield, A., 247, 261 
Sutcliffe, A., 186, 263 
Sutton Booklet, 186 ° 
Ѕућопаѕ, P, M., 228, 262 


179; see або Terman-Merrill 
b ° 


Thamet tests, 176, 178, 182 е n 
Thomsop, G. H., 66,123,*1275.* 139, 
es. 141, 143, 145g., 147, 173, 174, 263 
intelligence tests, 183 e 
Thornĝike, E. L., 23 A 
Thorndike,R. L., 123, 128, 263* 
Thouless, Кан, v, 74n., 121n,, 129, 
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Thurstone, Г. L., 23, 141, 143-4, 147, 
269, 264 М ы 
eprimary abilities tests, 185 


+ Tomlinson, T. P., intelligerice tests, 
181, 184 % 
Tulchin, S. H., 164, 264 6256 
К 


' Я 
Valentine, С. W., 208-9, 251, 264 
intelligence and reasoning tests, 160, 
180, 186 ве, 
Vernon, Н. M., 187., 264 Т 
Vernon, P. E., 197., 68, 132, 135, 139» 
144n., 143,45, 157, 161, 180, 181, 
186, 187, 193, 202, 247, 24g, 264 
educational tests, 173, 177” 
Study of Values, 187 
Vincent, D. Ғ., mechanical est, 185 
Vineland Social Maturity Scale, 187 


WAIS, see Wechslér 
Walker, A. S., 187, 251, 264 
$ Walton, В. D., оше test, 177 
Watts, A. F., edu tional and language 
tests, 174, 176, 264 
performance test, 481 . s 
spatial test? 184 47% 
Wechsler, D., intelligenae tests, 68, 72,« 
153, 157, 160, 161, 162, 164,.168, 
170, 179, 194, 264 
Weiss Long, А. E., 186, 265 
West Riding intel} се test, 184 
WesteYorkshire intelligence test,,184 
Whilde, N. C., 154л., 265 x 
Whipple, G. M., 186, 265, 
ilmer, E. N., 186, 265 


° 
Wilmut, F. $., 209, 260 ° ^ 
Wing, Н. ®., Music tests, 186, 2657 
WISC, see sler 
Wisemart, 5% 205,265. E 


interest test, 187 


intelligence test, 184 5 к ^ 


„| "Wood, B. P., 203, 265 ‚ * 
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Yates, F., 85, 261" е 
Үйге, G. U., 107 
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ч Аз discrinfinable grades об, 31, 034, 
67-6 
"facipriat Analysis of, 112, Ch., УШ, 


eo) 
оў аа апа ant of, 24-5, 
2-3, 53, 56, 58, 62-5, 67, "150 
measurement by sam} pling, 20, 87, 
127, 131-2, 152-3, 2002, 212, 227; 
" 229, 255 > 
methods of scoring, Ch. IT 
» ‘ature df, 19, 130-2 2 
numerical evaluation of, 18-19, 27, 
28-31, 33-5, 56-65, 149, 203 
qualitative grading of, 11, 26-8, 32, 
® 945, 205, 257 
rate end power plans of scoring, 21, 
-3,1 
units in measurement of, 3, 21-5, 30, 
59-60, 65, 70-1, 194 
Accident proneness? 17-18 
9 Accrediting, 252 
chievement tests, see а 
tests 
Age, allowances, 53, 1951. 
determination of average, Ж 43-4 
factor imcorrelations, 120-2, 128, 
A 131, 134 
growth of abies with, 69-71, 135, 
152, 157n., 209-1 
ranges of tests, 1 о, 162, 171-2, 
188-9 
tabulation of, 5 „° 
American e: ations, 198, 252, 256 
,,new-type, 21 
American tests, 170, 172 
British. performance ой, 67, 75, 81, 
^ 149, 159, 560 
construction of, 189-90 
for secondary schools, 155 
prognostic, 250 * 
purchase of, 170 А 9 
Analysis of variancé, "85-7 ie 
Analytic schemes of qe see 
B meu db 
rithmetic ability, group Sacr y 
131 "5 126, 137, 138, 143 
measurement. of, 19-20, 257132, 135,» 
А 207, 211, 255° 
new-type examinations % 232, 240 
» range of, 63, 65-6 
tests of, 21, 153, 154, 164, 174, 177-8 
Arithmetic Age, 69 
Attainment tests, see Educational tests 4 
Attenuation, 129, 1347., 209 
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Australian tests, 170-2 
university entrance, 249, 252 
Average, see, Mean 
Average Deviatior? (A. D.), see Mean 
• Variation 


Bi-factor analysis, 140-1, 147, 152 
Bimodal distribution, see Frequency 
distributions ы 


Character traits, factor analysis of, 141 

tests and ratings of, 43, 19-20, 704, 

126, 144, 148, 154, 186-7, 197-8, 
246-8 


Child guidance, use of tests in, 39, 145, 
147, 154, 161, 164—5, 168- 9, 192, 
193, 208 

Children; effects of examinations ‘on, 

ъ 199, “246, 250 

with language handicaps, testing of, 


young, testing of, 158, 159, 160-1, 
162, 164, 168-9, 191 
шау attitudes, to marks, 29, 33, 


to new-iype examinations, 213 

to tests, 158, 161, 164-5, 169, 191 
Chi- -squared Q у 6m, 88- 91, 104— 7 
БИШ; selection for, 135, 247, 


Clerical abijity tests, 185 , 
»Coapse grouping, effects on correla: 
tions, 98, 101 

on means, 42 

on standard deviations, 1 
Coefficient of Association, 107-8 
Colligation coefficient, 107 
Colour. blindness, tests of, 186 L 
Column diagram, see Histogram 
о of tests, see Marks and 


Сш кашу, 137, 1427. е. 
UNE of abilitiespsee Marks and 


Comptnsatio®’ 252% 130, 133-4 
Concept formatiort, 146, 157 

Gonfidence limits, 78, 80, 83, 110 
ротар coefficient (©), 104-7, 


Оа, applications of, 92, 112, 
118-29, Cris. УП, VIII, 152-3, 
е. 201, 209, 224, 227, 247, 249 
attenuation of, 129, 134л., 209 
average, 424-5 


, Attitude tests, 19, 187 - *” ®, 


ооа 


between categoric variables, 104-8 


et 


° 
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ЖЕ 
• 
Correlation (сот.) 
biserial, 108–9• ; 
combining and centrasting coeffi- 
cients of, 110—1 


2 . 

forecasting efficiency 112, 118-20, 
123, 209 . 

from regression Іше, 115 

from gmall populations, 76, 109, 140 

in homogeneous populations, 122-3; 
128, 139, 201, 205, 249 

influence on, ¥ irrelevant factors, 
120-3, 131 

interpretation of, 118-23 

inverse, seegnegative 

Kefidall’s те од, 104a 

multiple, 124—5, 1457, 

fiature and objects of, 2, 92-4, 112, 
118 P 


negative, 94, 107, 132, 134 е 
artial, 121-2, 131, 136, 139. 
rper. method, 94' 
Probable Errors of, see Reliability 2 
product moment methods (r), 92, 
94-102, 108, 109, 115 
rank order method | (p), 26n., 92, 
103-4, 110, 
reliability of, g.v. 
ré, 106-7 
Spearman feot-rule method, 104 
tetrachoric (rj, 108 ` 
z technjque, 110911 
Correlation Ratio (n), 104-5 
Count scores, 3, 20-5, 28-30, 212, 257 
Critical Ratio (С.Е.)/ 80, 82,85 
Cross-classification of pupiff, 92, 186 
Cumulatite frequency graph, 7), 
Cumulative record cards, 163, 187, 
250-1%52 e. 


Deciles, 37-8, 67 9%; 
ратна of, Freedom, 857., 88n., 9ол., 
06 > 


Develop ental Quotient, 1577. 
Баш tests, 154, 170, 174, 177, „_ 
б 


Dictation, see Spelling » 

Dispersion of measpires, 2, 16, 36, 40, 
‚45, 53-4, 62-3, 65-6, 81, 87, 493-4 
indices of, see Mean Variafion, 
. Rafige, Standagd Deviation 

Distractions if testing, 191, 192-3 

T ability, tests, 135, 136, 1557 

7 .” 


effect of, upon reliability E 
2, 84-5 * 
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Educational bgelofardness, 71, 12%" 


158, 161— 
Educational Suidancé, 3, 317.,¢68, 
144-5, «1540 163, 167, 190, 199-6, + 
, 248-9, 251 А 
Educatignal objectives, 198, 253 
Educational Quotient (Е.03,71, 167-8, 
ucatiorial tests, gl, 31p., 153, 154-5, 
— 17-3, 2Ж' 3» e 


сһоїсё of, 153, 188-9 
constructign of, 153, 189 
correlations between, 134-5 
functigns of, 154-5 
norms, 69—73, 154 
reliability of, 126, 127-8, 1945, 150, « 


М e WX 
* validity of, 158-1, 228, 230 


. 


English ability, 


See also Arithmetic, Reading, etc. 
factorial analysis e € 

136, 139, 142. „ 

tests of, 75, 153, 154, 157, 176-7, 
234, 242, 250 


See also English, compositiort, Read- 
ini . 


5 D 
English composition, marking of, 43, = 


E 


21, 25, 27, 28, 99, 31, 34-5, 201-7, 
211, 256 
range of ábility in, 65-6 
subjects for, 64, 149, 256 
tests of, 135, 155, 175 
training in, 99, 224 xi 
variations in abilftyat, 63, 201-2, 205 
xaminations, age allowances in, 53, 
195n. 


as commercial qualifications, 197-8, 
210 • 


2. 
as pedagogical aids, 32, 198 
coaching for, 199, 208, 210, 212, 223, 
250,252 "* @ * 


criticisms of, 148-51, 154 СВ, XI, 
223, 226, 250, 252, 264 
ease or difficulty of, 24-5, 33, 63-6, 
230-1, 245, 246, 254, 256 
SENI of, 154. 197-8, 208, 246, 
3 


. 
impfovements in, 53-8 
E of, 206-7, 220, 227-8, 
256-8; seevalso Marking of scho- ° 
lagtic му ~ 
new-type, фу: - 
аах candidates, $3, 


220 
of small хе? of candidates, 35, 
66, 252, 255 . 
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58, 
тај, 246~ 


О: 
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Educational ability, general factor of, 4 * passmarks in, 29, 32, 35, 201,206, 
Ы » 


Ed d val Ree ), 69,40-5, 171 
ucational (E.A.), 69, , 
*188-9, 194-5 : 
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12, 258 
reliabjlity of?2, 128, 200-8, 210, 223, 
*227-8, 2%, 250, 255-6, 258 
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Exaininations (cont.) . Ф 
setting questions for, 64, 148, 151, 
253-6 


err for, 246-52 ә 
m of, 2, 87-8, 151, 208-10, 


Examinations, Baccalauredt, 2055 
СКИ Service, 135, 202, 247. , 
General Сетийсуе, УД 135, 19% 

2198-9, 202, 206, 210, 250, 252. 
Leaving Certificate (Scottish), 198, 
9, 210, 250 

Matriculation, 2, 197, 210 
Qualifying (Scottish), 31,717 > 
Scholarship, 108, 206, 268-10, 228 
School Certificate, 27, 203-4 
Sene (internal), 53-66, 497, 2525 


Setondary School ЕЕ 4.у. 


2 Бреста! Place, 203, 2 


ШУЛ College; 79-30; 58, 135, 


University, 30, A 58, 62-6, 148, 
Ds 197-8, 20 1,.202, 203-4, 209, 


= 


F-ratio, 87 
Facto? patterns, 138, dct 144, 146, 
147, 152 


effect of heterogeneity оп, 139 
effect of tuition on, 128, 144 

Factorial analysis, Ch. VIII 
limitations of, 138-40, 146-7 
techniqu eg of 136-7, 139, 143, 147 
uses of, 20, 130, 135, 138; 144-6, 

152, 16 8-70, 224" 
Factors, сооп, 120, 137, 141 
is P D › 136, 137n., 139, 140-4, 


groüp, 131, 136-45, 147, 151, 152, 
161, 169,195, 226, 250. 

multiple, 140-1, 143-4, 146, 147, 152 

aes 127%, 140-2, 144,145, 150, 


йош, 20, 130-138, 143, 145,224, 


Formal discipline, 199, 253, ` 
Free word association : tests, 39 
Biens "en of, 178 T 
requency distribution: symmetrical 
> бт., 15, 17, 31, 104° 
bimodal, 15 
COPIE and сзпрайш 53-62, 


curves, area of, 11, 16 
“definition of, 4. С 


dissimilarities i in, 54 59 | 


а represefitatfon of, 7-11, 
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Frequency distributions (cont.) 
Ша ш, 18) 24, 29—30, 35, 
1 
$ normal, see Normal distribution 
of classified measures, 5-7 
of sorrelations, 76 
of differences} 76-80 « 
of means;17, 75, 82-3 
rectangular, 26254 
_ Skew, 15, 17-18, 23-4, 5 
*..39, 54, 57 
statistical methods, for Жапа, 
Ch. Ш, 54-8 
tabulation of, 5-7 " 
Frequency polygon, 7-11, 13-16, 
Function fluctuation, 1289, 1377., „200, є 
209-10, 327 4 


g factor, 142-6, 152, 161, 168, 169 
Gaussian distribution, see Normal 
distribution 
Goodness of fit, 50-1, 90-1 
Group intelligence tests, applic ation 
% ue 68, 162-3, 164, 167,/190-2, 
9—50 


batteries, 142, 156 

choice of, 153, 188-9 

classified list of, 153. 181-4 

compared with ‘individual, 162-6, 
167-70 

construction of, 152-3? 155- 6, 188-9 

differeatial, 169-70 

distributions of scores on, 10-16, 32, 
50-1, 81, 83-4, 90-1, 188-9 

effects of health and mood on, 151, 
162, 64-5, 169, 191 

ЫТ of practice and coaching ой, 

157п., 165-7, 191, 195 

scoring of, 156, 164, 189 90, 191-2 

time limits, 122, 156, 16324, 190 

validity of, 151-3, 167- 9, 249 

See EE Mental tests v 


39, 31-2, 


Handedness tests, 186 
Handwriting ability, distribution of, 17, 
„factorial analysis оҒ,,136 » 
marking of, 25, 29, 44 s 
tests of; 39, 135, 155, 175 
Heterogeneity of Populations, 122-3, 
5.28, 135, 139, 201, 204n., 205, 209, 
Histogram, 7-11, 13 = 
Honours students, superior ability of, 
~ 16, 81 =2, 209 


Intelligence, age and, 69-71, 94, 249 
definition and nature of, 19, 130, 142, 
146, 152, 15627, 1669 
economic status and, 8, 70, 74-5, 
121, 158 s 


c? 


as 


. . 
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Intelligence (cont,) . 
educationgl aetainments and, 69, %, 
108, 122, 125, 439, 144, 146, 153, 
156-8, 164, „168-9, 1%, 2% 
„ 249-50 : 
hereditary*and environment. influ- 
ences оп№68, 15668, 167, 249 
language handicaps ang, 158 
nationality and, 68, 15; 
physjque and, 134 • . 
sex апФ4»16, 76-80, 157л. * 
subjective judgments of, 139, 148-50, 
151-2, 196 
vocation and, 144-5, 146, 157, 195 
InteJligence Quotients (1.О.), distribu- 
tion of, 45-19, 49-50, 52, 72-3, 195 
ifterpretatio& of, 19422, 36, 71-3, 
193-6 


+ 
• prediction of, 112-18 
Stability of, 126-7, 156-7, 165, 168, 195 
МАШЫ nton of, 52, 71-3, 
157п., 195 
сепсе tests, see Group, Non- 
vorbal, Terman-Merrill, etc., te: 
Interests, factorial analysis of, 141, 147 
intellectual, 434, 139, 141, 144, 146, 
157, 246 Д 
tests, 187 
Internal consistency, 152-3 
International Examinations énquiry, 
203, 205-6 
Interviews, 246-8 \ 


, . 
К ог k zm factor, 144, 161 


Language developnttnt еф, 174, 176 

hetter grades, 11, 26-8, 20. 

Linguisffc and literary abilitie® se 
English, French, Reading, etc. 


Li 


. 
Mgchine-Scoring of tests, 189-90 
nual abilities, factorial analysis of, 
132,34, 136, 137, 142, 144 
tests ой 132, 135, 186 
Marking of 
. 


е 


scholastic work, analytic, 
28, 203-4, 207, 211-12, 223,25 

by ranking, , 27, 28, 30, 32, ЭЎ 
59, 65, 104 2572 . 

improvements ig, 31-5, 58, 59—62, 
256-8 : 6 


E 

objective vs. subjettive, 11-12,0f0-1, 

2830, 32-3, 35, 65—6,149,155, 202. 

3, 206-79 211712, 226-8, 255, 25% 

Standards of, 27-8, 33-5, 53, 58, 

63-5, 86, 150, 200, 20253. 205, 
251-2 + 


teachers’ idiosyncrasies in, 6, 27, 29,, |“ 
33, 62-4, 202,957. ” > 
See also Ability, Arithmetic, English 
* composition, Examinations 
> 


s5 
e 
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SUBJECTS , 5 “2 
Marks and combination o£, 22, 
26, 28, 53-0, 59, 123-5, 204 


comparisoris of, 26 32, 53-4, .56-8, 
59-60, 69 sa ae aei 


disgibutions of, 4-7, 11, 24, 26," 
2 32-5, 48, 51-8, 59-62, 86, 
. е, 
scaling of, 33, 35, 56-62, 1240 207 
з тра 33, ҒЫН у 
significance of,'24-5°27-30, 3% 36, 
33-40, Ch. IV, 150, 210 9 
weighting of, 54-5, 66, 123, 125, 244, 
Mathe atics tests, 177, 185,250 * 
сап, arbitrary or guessed, 42-5 
dex, 85, 95 shea hee ү 
ean, abighmetic, caleulation of, 1, би. 
v 38.9 40 9 48 SUCEDE 
effect of, in combining marks, 64-6 
S.E. of, 82-3 « 
Mean Vąriâtion .V.), 45, 48 
Measures, definition of,3 « 
Mechanical ability (m-factor), 130-1, 
144, 145 A C. 
tests of, 1846" „ qe 


^ 

Median, 36-9, 45, 58 nA 

Memory, 130, 132, 142, 143, 145, 168; 
193, 253° * 

Mental Age (M.A.), 23, 69, 70*1, 156, 
159, 160, 161, 162, 17152, 188—9, 
194—5 білуді 

Mental deterioration, 145-6, 170 € 

Mental measurement, 19-20, 194; see * 
also Ability 

contrasted with 
ws ATE 130, 132, 138, 

Mental organisation, р A , 
143, 146-7, 2425 2 

Mental speed, 130, 132, 144, 146, 
163-4 Ж № 


physical, 19, 21-4, © 


te 
Meu teste, desirable qualities in, 
| 


‹ Г 
training needed, 73, 153, 159,, 160, 
162, 164, 190, 193, 196 
Mental tests, classified list of, 153-4, 
170-87 
сове рсНоп off lo, 23, 143, 148-53. 
further developments of, 169— , 250 
instructions 1484149, 159, 189-90 
intgr-relations of, 92, 94, 129, 130-5; 
., See als Pactorial analysis 


 limitatiens of, 67-73, 126-9, 149° 


151, 15% 159-60, 161-70, Ch. X, , 
2492508 => р 
norms, 25, 26, 28, 40, 55, 58, 66-73, « 

150, 160, 161, 169, 171-2, 189, 
490, 191-2, 195, сю 


a °” origins of, 148 


ostic,M 519250 
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ersef, 171 
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Meatal tests (cone Ws 
reliability обо, #7, AZln., 126-9, 131, 
«137, 150, 44, 165, 168-9, 188, 
1945" 
scoring of, 149-50, 159-60, 164, 
189-90, 991-2 
standardisation of, 25, 67-8, 69-70, 
148,59, 153. 159-60, 19: 
validity of, 2, 87:8, 118-19, 123-5, 
9129, 150-3, , 16 ay 188, 249-50 
мире factor analysis, 140-1 143-4, 


Musical Ss fiers ier vod of, 


tests. of 186 
® 
onm examinations, Pci 
over essay-type, 52, 211-13, 
2923-4 
® applicability of, 20-1, 35, 66, 219-20, 
228, 239, 243, 25 
best-reason questi ns, 237-8, 239 
completion questions, 214, 228, 231, 


272-4, 244 
elects of, 151, 2117 219-28, 232-3, 
-6, 240-1 


" distributions of marks on, 29-30, 
32-3, 212, 220-1,227, 230-1 

elerte istic tendency dh, 223-5, 
226, 228, 235, 255 

guessing factor i in, 220-2, 232, 235, 
236, 237, 244, 258 ~. 

irrelevant clues 1225-6, 233-4, 


Ф 


236, 240-1, 244 

matching questions, 218-19, 231, 
238-42; 244 

multiple-choice tions, 215-17, 


D 


220, 222, 266 Io 2А 231, 234n., 235, 


eos 243 + 

prineiples of construction, 32-3, 153, 
212-13, #2; 224-6, 227-8, Ch. 
хш 

rearrangement questions, 222 m 

reliability of, 210, 221-2, 

HAE of, 220,7 4, 242-3, 245 


212.213-14, 


Lu 


е recall qu^ А ions, 
ЕУ 226, 231 232-4, 239-49, 244 | 


*o 


т 


specimen paper?213-9, 258 + 

sugge ive effect oe false Statements 

die false questions, 21445, 220-2 
228, 231, 234-6, 244 

sa of; BL 22, 228, 245 

weibhting of marks, 

Non-verbal P 204. tests, classified 
= [ist-of, 153 181-2 $ 
factor analysis of, 144 
uses of, 158, 163,250^ 
See also Performance tests ^ 
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ғ 
р 


Normal curve of error, see Normal 

7 distribution © 

Normal ution algebraic equa- 
tign of, 4 . 

арыну" to reliability, 7680 

application to testconsfruction, 23-4, 
31, 65,1 2 

applicatiqns to school marking, 
17-18, 24, 26731–5, 55, 59-62, 205, 

^ 257 a 

definition of, 17 

disturbing factors in, 17-18 

examples of, 17 

ordinates of, 108 ' 

testing conformity to, "50-1, 90-1 

Norms, see Mental (е5 

Null hypothegis, 88, 89, 105 

Numerical déscription of distributi ions, 

36-7, 39-40, 49, 56 
Numerica: evaluation of abilities, see 
"Ability 


, 
O£ive gtaph, 7n. ^^ 
)mnibus tests, 156, 190 4 
ne-tail and two-tail significance tests, 
79n. 
Oral group tests, #63, 181 


Y 


Passmarks, see Examinations 

Pearson-Bravais ШОЧ; see Correla- 
tion* 

Percentage marking, defects of, 28-31; 
see also Abilitypnumerica] evalua- 
tion of 

Percentages, S.E. of, 84, 86, 90 

Percentilesgraphs,' 56-61 

Pergentiles,” 36-7, 38-40, 49, ‚58, 156, 
17f, 189, 194 

comparison of ОВ Бу, 
56-61, 65 а у у 
sizes of units, 59; 67 a" 

Performance tests, application of, 136, 

158, 150-2, 169, 180-1,,189, 199 
classified list оғ, 153, 1801 
construction of, 148-30 
-. factor analysis с), 144 
“improvement on, with age, 74 
limitations of 426, $61 
scoring of, 21, 149, 155, 161, 189 

PerseVeration tests, 104 

Personality, study of, 141, 147, 161, 

» 169, 193-4, 246, 248; Же also 

" Character, Temperament 
fests, 154, 186-7 
Рһузїсй abilities, distribution of, 17 
теаѕшетеп! of, 23 
tests of, 132, 186 

Political opinions, ‘distribution of, 
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scaling of, 19 
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Practical ability, 134, 136, ен Reliability (cgn, be et 
161, 168, 1®3 of differen: 2, 7465, 76-82, 84-5, 
> factor (F), 1447-9161, 86 бм е n а 
tests of, E Pe \ of means, 74-5, 81:82 3684 6 
= Prgctical ex inations in sciegce, 24. of gercentages, 84, 86, 90 в 
Practical Мүм ability, 136, 146, of predicted scores, 89-8, 117-19, РИ 
249 е 123, 196-7, 157 У вс 
Practice, queen of, Qa intelligence of ers. 87, "I 194 aee 
Ж tests, 127, 1974, 165-7, 191, 195 |, of dmall sample, 5 
| ee. sheets, 156, 189, 191 . of standard deviatiofis, 83-4 ® 
д Probability integral tables and graph, of Z 110 5 
| 49–52, 59, 60-2, 77, 79 theory of, 74-80 Ld 
j EU theory, 51-2, 63-4, 77-80, See also aminations, Mental tests 3 
О * © 
| Probable Emor«P.E), 78-80, 87л., t 5 
| c? 109, 2049 see also Reliability Sample test items, 156, 189,491, 264 с 
Prodluct-mome&t methad, see imp the population, unselected, 
Correlation + 7, 67-70%74-5, 80, 92-3 
Product scales, see Quali scales Sampling errors, 63, Ch. У,е 105, % 
Projective techniques, 187 109-10, 126-7, 128-9, 137n,.138p е 
Pupils’ records, see Cumulative rétord 142,195, 20051 


cards Sampling of ability, see Ability 


E » e Scaling, see Marks and scores У 
Quality scales, 34, 1497., 154 5, 176, Балаа. Кү 101, #04, 105, 
‚117; c 
178, 254 Scatter of measures, see Dispersion. | , 


t n та Qa), 36-8, 40, 3 4, 56, Scholastic abilities, factorial апа! ysi$ ^. 
2 of, 92, 134*6, 142, 144 


4 Quota scheme, 251 tests of, See Educational, Ayithmetic, f 
etc., tests D 

Range of measures, 1, 6, 36,%0, 45, 48, | Scientific approach to psycliológical 

51, 55, 66€ see also Disjersion problems; 3, 74, 76, 78, 86,130 = 
Ranking of measures, 4, 103 › 195; Scotland, test neems in, 68 ы 

зеё also Marking Secondary education, selection for, 
Rank-order method, see Cotrelation 31, 71, 92, 119-20, 122, 124-5, © 
Rapport їп testing, 16$, 1943 154, 163, 164, 165-9, 1955., 197, 


Rate and power plans, se&Abilit 4 198, 199, 205, 209, 223, 224, 228 
Eug ability, factorial апу of, веш ШЕЕ range (0), 40, 45, 
134-6 79. 


таг] of, 29, 56, Sensory-motor (65159186 к 
I bi, 21, 22, 74. 135, 151, 153, | Sex differencfs іп ability, 3s 15616, А 
173-5 76-81, 83-4, 1571. e e 
*** Reading Age, 69 2524 in cinema attendance, 89-90 д 
Reasoning abilities, 143, 146, 157-8, in milk-drinking, 84 
223-5, 2, . Simple structure, 141 
# etests of, 17 M 2 Skew distribution fyee Frequency 
Recognition tybe i ms, 144, 167, 169; TP jeugn 2 
212, 222, 28555232, 224-43 Smal, “ample techniques, 85 x 
Regression, 1549, 123, 133 «Spatial judgment (#-Тастог), 143,7144, • 
equations, 115-16 А + s Сө? 
multiple, 125 ¢ .. tests off Ме м e 
non^inear, 104-5 „_)45цеагтаг&Вгоууп prophecy formu&,. 
Reliability, сбейібепі of, 120/7129 Зе] ^ 127, 2 Ж 5 
effects of function fluctuation, 128-9 | Spegd Ретов ity, 130, 132, 144, 
effects of heterogeneity, 128®»” 146, 143-4, 168; M 5 5 © 
е effects of length of test, 126, 127-8 Spelling ability, factorial analysis of, t 
formulae for, 81-5, 109-10, 117, 34-6 ` г 
126-8 t. «measurement of, 22, 25, 26 
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of correlafions, 74, 76, 84, 81, - . „| > tests of, 21135 175, 234  . © 
* 109-10, S Spread of mesures, see Dispersion 
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Standard Deviation ЖӘ, or a), 36, 40, 
» 87%: 
З сазе жмем, 83 
о  olaveraged measures, 12: 
E of, vin ыы тзв, 54-6, 
Standard Error (S.E.), 78,87n., 8%; 
> ее also Reliability 
Standardisation, 50 Mental tèsis — . 
Staldard scores, 55-6, 59-62, 65, 67, 
"69, 73, 115, 123, 156, 171, 194 
Stantards of marking, see Marking of 
scholastic work 
Statistical methods, definition c f, 1,2—3 
limitations of, 2-3, 80, BE, 146-7 
з objects of, 1 24 
Statistical significance; 79-804: see айзо 
ріл и 
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52 85, 87, 110 
Tabilatéd measure, average of, 40-5 
correlation between, 98-102 
percentiles of, 38 
st&tdard deviation 2f, 47-8 
» Tabulation of measures, 1, 3-7 
» "Television viewing, 405-7 
Temperament tests, 19; 126, 161, 186-7 
У, iod deren of, 141, 144, 146 
} Tetrad difference criterion, 142 
T n IS 2 а. 156, 
' Training College stedents, abilities of, 
39-40, 75-6, 80, 86, 135, 136, 139, 
ела» = 
тивіууо: ess of statis! 
see Reliability 
Two-factor theory, 141-5 
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WES factor patterns, 141-2. ‚14 7, 


United States, УРА 

Мтпіуегейу educat‘on, selection of 
stvdlents for, 2, 122,435, 197 198, 
208-10, 228, 247, 248-5, 252 


Validity, see Examinatiogs, Mental tests 
Variability, see үрер; Reliability 
of айаіптеп,ѕ, 1-2, 17, 62-3, 129, 
131, 135, 200-2 
Variables, caiesoric, 88-90, 104 7, 108 
definition of, 3 
discrete and contin ous, 3, 5, 50 
traits zm bp. 19 
Variance, 8 
Verbal ability (v or Огу ое Pac tor), 443, 
44, 145, 168, 193; see also 
English, ability › 
Verbal tess, see Group intelligence 
“tests У 
Viva voce, see Examinations, oral 
Verabulary, see Language deve, 
n, ment, Stanford-Binet, etc. 
Харра abilities, factor anarysis of, 
Vocational ар, 144-5, 154, 163, 


Vocational selection, 119-20, 125, 135, 
145n., 161, 210, 247-8 

Vocation: | tests, 55. 118 19, 125, 
151,154, 184-6, 159 


War Office Selection’ Boards, 348 
Weighting, see Marks and scores 


X-factor, ¥ и ? 
ге ап Диев, 110-11 


